Bazel: gRPC resource exhaustion errors during BEP upload

Created on 5 Sep 2020  路  9Comments  路  Source: bazelbuild/bazel

Description of the problem:

gRPC resource exhaustion errors during BEP upload when a very large build completes very quickly due to cache hits.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

We often run into it when a very large build completes very quickly due to cache hits (and --bes_backend is specified).

What operating system are you running Bazel on?

Ubuntu 18.04

What's the output of bazel info release?

release 3.3.0

Have you found anything relevant by searching the web?

Similar issue encountered:

Any other information, logs, or outputs that you want to share?

Error logs:

ERROR: The Build Event Protocol upload failed: Not retrying publishBuildEvents, no more attempts left: status='Status{code=RESOURCE_EXHAUSTED, description=grpc: received message larger than max (10913322 vs. 4194304), cause=null}' RESOURCE_EXHAUSTED: RESOURCE_EXHAUSTED: grpc: received message larger than max (10913322 vs. 4194304) RESOURCE_EXHAUSTED: RESOURCE_EXHAUSTED: grpc: received message larger than max (10913322 vs. 4194304)

Bazel Exit Code:

38

Other Info:

  • We are using BuildBuddy as our BES backend.
  • Our current workaround is to disable BEP for the build in question.
team-Core untriaged

Most helpful comment

Thanks for reporting @SrodriguezO and for the background @ulfjack.

We've bumped the default max grpc limit in https://github.com/buildbuddy-io/buildbuddy/commit/7cd6929ac9663439d63d74ab201fca72cc3b9054 which should go live in the next release (targeting this afternoon). Configurability incoming as well.

Feel free to upstream your changes in the future @zachgrayio!

All 9 comments

This actually looks like a buildbuddy bug a few folks have asked me about recently, not a Bazel issue.

Your BES upload is greater than the gRPC default of 4mb, hence the message you see here (you're sending 10.6mb).

If you're using a fork of buildbuddy, which I assume you are, then you can go and fix this yourself like so:

diff --git a/server/libmain/libmain.go b/server/libmain/libmain.go
--- a/server/libmain/libmain.go
+++ b/server/libmain/libmain.go
@@ -211,6 +211,7 @@ func StartGRPCServiceOrDie(env environment.Env, buildBuddyServer *buildbuddy_ser
        grpcOptions := []grpc.ServerOption{
                rpcfilters.GetUnaryInterceptor(env),
                rpcfilters.GetStreamInterceptor(env),
+               grpc.MaxRecvMsgSize(1024*1024*20),
        }

If you'd like to take a look at alternative BES implementations, feel free to give us a shout :)

Hey @zachgrayio, thanks for your quick response. Is this something that gets configured on the server or on the client? I found this grpc-gateway comment and this Stack Overflow response suggesting it was a client-side config.

Maybe it needs to be configured in both places? I can't find references to the grpc MaxXMsgSize configs on the Bazel project though.

Also, someone ran into this issue with Buildbarn as well (granted, they could also have the bug if it's a server-side thing).

In this case it's a missing server option I think (grpc.MaxRecvMsgSize()).

Here's some background on this issue: gRPC has a built-in maximum message size controlled by the receiver (in this case the buildbuddy service). The default value in Java is 4 MiB.

Bazel does not automatically limit itself to the server-defined maximum message size. Doing so is difficult, as some of the proto messages in the Build Event Protocol / Service are inherently monolithic, and cannot be automatically broken into separate messages. As such, we were targeting a maximum size of about 50 MiB.

Depending on which event is too large, you may be able to reduce the event size by setting --bes_outerr_chunk_size or --build_event_max_named_set_of_file_entries on the client, i.e., Bazel. The default outerr chunk size is 1 MiB.

Unfortunately, the error message above contains the error code, but not which message caused it.

Or set --legacy_important_outputs=false.

Hey @ulfjack thanks for the tips.

we were targeting a maximum size of about 50 MiB.

Do you know where this is specified? Mostly just curious.

Depending on which event is too large, you may be able to reduce the event size by setting --bes_outerr_chunk_size or --build_event_max_named_set_of_file_entries on the client

Interesting. It might make more sense to increase it on the receiver if the client's already allowing 50 MiB. We might look into that approach.

Or set --legacy_important_outputs=false

I'm curious, how come this might also help? The docs for legacy_important_outputs simply say "Use this to suppress generation of the legacy important_outputs field in the TargetComplete event"

I am not aware of any place where that is publicly documented. This is my personal recollection from working on the BEP.

I think it makes sense for BES implementations to provide a knob to allow larger than default packets. However, there are also reasons for preferring smaller packets (e.g., preventing service outages due to memory exhaustion), and there are knobs in Bazel to adjust that as well.

The original BEP design had a repeated field representing a flat list of all 'important' outputs of a configured target. However, this turned out to be problematic because some configured targets have a huge list of such outputs. We then migrated to a nested-set style listing of important outputs. However, this is technically an incompatible change, and so we added the --legacy_important_outputs flag. Maybe we should have called it incompatible_legacy_outputs or something.

Thanks for reporting @SrodriguezO and for the background @ulfjack.

We've bumped the default max grpc limit in https://github.com/buildbuddy-io/buildbuddy/commit/7cd6929ac9663439d63d74ab201fca72cc3b9054 which should go live in the next release (targeting this afternoon). Configurability incoming as well.

Feel free to upstream your changes in the future @zachgrayio!

Closing this as it doesn't seem to be a Bazel bug after all. Thank you @siggisim for the quick turnaround!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

meisterT picture meisterT  路  98Comments

damienmg picture damienmg  路  67Comments

laurentlb picture laurentlb  路  101Comments

johnynek picture johnynek  路  105Comments

laurentlb picture laurentlb  路  130Comments