Bazel: Bazel CI: RBE builds are broken after grpc java upgrade

Created on 14 Oct 2020  ·  23Comments  ·  Source: bazelbuild/bazel

https://buildkite.com/bazel/bazel-auto-sheriff-face-with-cowboy-hat/builds/306

ERROR: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/cfad747ece6c2992c5b867a14a43555e/external/org_golang_x_crypto/curve25519/BUILD.bazel:3:11: GoCompilePkg external/org_golang_x_crypto/curve25519/curve25519.a failed (Exit 34): com.google.devtools.build.lib.remote.BulkTransferException
    at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:225)
    at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:331)
    at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:486)
    at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:306)
    at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:134)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:102)
    at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
    at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
    at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:331)
    at com.google.devtools.build.lib.actions.Action.execute(Action.java:127)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:859)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1019)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:978)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:129)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:81)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:469)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:845)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:314)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
    Suppressed: java.io.IOException: io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Bandwidth exhausted
HTTP/2 error code: ENHANCE_YOUR_CALM
Received Goaway
too_many_pings

Verified by building with Bazel@d4cd4e7ab18ebeae4152dafc113367289ffebb12 and its previous commit:
https://buildkite.com/bazel/culprit-finder/builds/581
https://buildkite.com/bazel/culprit-finder/builds/582

Culprit: d4cd4e7ab18ebeae4152dafc113367289ffebb12

P1 breakage team-Remote-Exec bug

Most helpful comment

Great, then I can make the PRs: add 1.32.x, switch to 1.32.x & bring auto flow control back, drop 1.31.1

All 23 comments

FYI @dmivankov, can you help look into this issue? Is this a bug in protobuf?

https://source.corp.google.com/piper///depot/google3/third_party/bazel/src/main/java/com/google/devtools/build/lib/authandtls/AuthAndTLSOptions.java;l=119?q=%22grpcKeepaliveTime%22
Maybe we should also change the grpcKeepaliveTime value here?

/cc @coeuvre @buchgr

Interestingly I couldn't immediately find keep-alive default timeout changes in https://github.com/grpc/grpc-java v1.26.0->v1.31.1 or in https://github.com/netty/netty netty-4.1.42.Final -> netty-4.1.48.Final

20sec in GrpcUtil
https://github.com/grpc/grpc-java/blob/v1.31.1/core/src/main/java/io/grpc/internal/GrpcUtil.java#L205
vs
https://github.com/grpc/grpc-java/blob/v1.26.0/core/src/main/java/io/grpc/internal/GrpcUtil.java#L203

But this looks interesting
https://github.com/grpc/grpc-java/pull/7015 auto flow control was turned on between 1.26.0 and 1.31.1
https://github.com/grpc/grpc-java/pull/7302 users reported problems

Does setting environment variable

GRPC_EXPERIMENTAL_AUTOFLOWCONTROL=false

fix the issue on client side?
If yes (or maybe even if it doesn't?), we can try

NettyServerBuilder builder;
..
builder.flowControlWindow(NettyServerBuilder.DEFAULT_FLOW_CONTROL_WINDOW)

in https://github.com/bazelbuild/bazel/search?q=NettyServerBuilder
note: there seems to be no method to directly disable auto flow control other than manually setting default window https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/NettyServerBuilder.java#L387

Drafted a PR https://github.com/bazelbuild/bazel/pull/12266
Disclaimer: I neither tested it myself nor have reproduced the error in question

We can also try a forward fix going for v1.32.2
https://github.com/grpc/grpc-java/releases/tag/v1.32.2

netty: BDP ping accounting should occur after flow control. This resolves an incompatibility issue introduced in v1.30.0 and could be worked around via GRPC_EXPERIMENTAL_AUTOFLOWCONTROL=false introduced later. The symptom was a GOAWAY with “too_many_pings” without an aggressive keepalive configured. The environment variable is still available, but will be removed in the future

https://github.com/grpc/grpc-java/pull/7446
https://github.com/grpc/grpc-java/pull/7503

The original default keepalive was to send no keepalive. Maybe that has also changed upstream? That seems more likely to explain the problem.

Since gRPC v1.32.2 fixes this, can we upgrade to that version instead?

yes, auto flow enables pinging
https://github.com/grpc/grpc-java/blob/v1.26.0/netty/src/main/java/io/grpc/netty/AbstractNettyHandler.java#L141 - this is where auto flow pinging gets enabled in v1.26.0 (same in v1.31.1, but v1.31.1 enables auto flow by default for both client&server)

Given that auto flow control is a new feature and there's some indication that it caused the regression I'd rather try disabling it first https://github.com/bazelbuild/bazel/pull/12266 as a more solid option.

v1.32.2 has fixes in that area, but it takes more PRs to bump again, unless there's an easy way to check whether it really helps before merging probably a good idea to try a faster fix.

I will prepare v1.32.2 though

interesting bit in grpc core 1.32.0
https://github.com/grpc/grpc/pull/23313 fixing https://github.com/grpc/grpc/issues/16210
hopefully it also improves situation rather than introduces new bugs around same area :)

@meteorcloudy The fix is merged as 6e94b05ac93a0ff2674a1ff3d8743481aaed4eed. Can you run the tests with it?

The issue still exists for rules_go. https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1701#41043323-056e-4542-8e18-101aecd887dc

Could rules_go failure be another (flaky) issue?

I don't think it's a flaky issue, there are other projects failing with the same error, and rerun didn't fix them.

Ok, so we can try next one of two things

  • go back to 1.26.0 jars
  • go for 1.32.x https://github.com/bazelbuild/bazel/pull/12273 (consists of 3 commits that should be ready to go as split PRs). If it works, wait some time and maybe try enabling auto flow control feature back (optional).

Is there a way to run those RBE tests on PRs?

Yes, I'll help test #12273

The RBE build seems to be fixed by upgrading to 1.32.x.
Another test with auto flow control feature enabled: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1703

The rules_haskell failure is caused by something else. So it looks like upgrading grpc to 1.32.x does fix the issue and allows us to safely bring back auto flow control.

Great, then I can make the PRs: add 1.32.x, switch to 1.32.x & bring auto flow control back, drop 1.31.1

This is fixed by upgrading grpc java version

Was this page helpful?
0 / 5 - 0 ratings