https://buildkite.com/bazel/bazel-auto-sheriff-face-with-cowboy-hat/builds/306
ERROR: /var/lib/buildkite-agent/.cache/bazel/_bazel_buildkite-agent/cfad747ece6c2992c5b867a14a43555e/external/org_golang_x_crypto/curve25519/BUILD.bazel:3:11: GoCompilePkg external/org_golang_x_crypto/curve25519/curve25519.a failed (Exit 34): com.google.devtools.build.lib.remote.BulkTransferException
    at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:225)
    at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:331)
    at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:486)
    at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:306)
    at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:134)
    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:102)
    at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
    at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
    at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:331)
    at com.google.devtools.build.lib.actions.Action.execute(Action.java:127)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:859)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:1019)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:978)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:129)
    at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:81)
    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:469)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:845)
    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:314)
    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)
    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
    Suppressed: java.io.IOException: io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: Bandwidth exhausted
HTTP/2 error code: ENHANCE_YOUR_CALM
Received Goaway
too_many_pings
Verified by building with Bazel@d4cd4e7ab18ebeae4152dafc113367289ffebb12 and its previous commit:
https://buildkite.com/bazel/culprit-finder/builds/581
https://buildkite.com/bazel/culprit-finder/builds/582
Culprit: d4cd4e7ab18ebeae4152dafc113367289ffebb12
FYI @dmivankov, can you help look into this issue? Is this a bug in protobuf?
Similar issue and fix: https://github.com/pravega/pravega/issues/1621, https://github.com/pravega/pravega/pull/1622
https://source.corp.google.com/piper///depot/google3/third_party/bazel/src/main/java/com/google/devtools/build/lib/authandtls/AuthAndTLSOptions.java;l=119?q=%22grpcKeepaliveTime%22
Maybe we should also change the grpcKeepaliveTime value here?
/cc @coeuvre @buchgr
Interestingly I couldn't immediately find keep-alive default timeout changes in https://github.com/grpc/grpc-java v1.26.0->v1.31.1 or in https://github.com/netty/netty netty-4.1.42.Final -> netty-4.1.48.Final
20sec in GrpcUtil
https://github.com/grpc/grpc-java/blob/v1.31.1/core/src/main/java/io/grpc/internal/GrpcUtil.java#L205
vs
https://github.com/grpc/grpc-java/blob/v1.26.0/core/src/main/java/io/grpc/internal/GrpcUtil.java#L203
But this looks interesting
https://github.com/grpc/grpc-java/pull/7015 auto flow control was turned on between 1.26.0 and 1.31.1
https://github.com/grpc/grpc-java/pull/7302 users reported problems
Does setting environment variable
GRPC_EXPERIMENTAL_AUTOFLOWCONTROL=false
fix the issue on client side?
If yes (or maybe even if it doesn't?), we can try
NettyServerBuilder builder;
..
builder.flowControlWindow(NettyServerBuilder.DEFAULT_FLOW_CONTROL_WINDOW)
in https://github.com/bazelbuild/bazel/search?q=NettyServerBuilder
note: there seems to be no method to directly disable auto flow control other than manually setting default window https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/NettyServerBuilder.java#L387
Drafted a PR https://github.com/bazelbuild/bazel/pull/12266
Disclaimer: I neither tested it myself nor have reproduced the error in question
We can also try a forward fix going for v1.32.2
https://github.com/grpc/grpc-java/releases/tag/v1.32.2
netty: BDP ping accounting should occur after flow control. This resolves an incompatibility issue introduced in v1.30.0 and could be worked around via GRPC_EXPERIMENTAL_AUTOFLOWCONTROL=false introduced later. The symptom was a GOAWAY with “too_many_pings” without an aggressive keepalive configured. The environment variable is still available, but will be removed in the future
https://github.com/grpc/grpc-java/pull/7446
https://github.com/grpc/grpc-java/pull/7503
The original default keepalive was to send no keepalive. Maybe that has also changed upstream? That seems more likely to explain the problem.
Since gRPC v1.32.2 fixes this, can we upgrade to that version instead?
yes, auto flow enables pinging
https://github.com/grpc/grpc-java/blob/v1.26.0/netty/src/main/java/io/grpc/netty/AbstractNettyHandler.java#L141 - this is where auto flow pinging gets enabled in v1.26.0 (same in v1.31.1, but v1.31.1 enables auto flow by default for both client&server)
Given that auto flow control is a new feature and there's some indication that it caused the regression I'd rather try disabling it first https://github.com/bazelbuild/bazel/pull/12266 as a more solid option.
v1.32.2 has fixes in that area, but it takes more PRs to bump again, unless there's an easy way to check whether it really helps before merging probably a good idea to try a faster fix.
I will prepare v1.32.2 though
interesting bit in grpc core 1.32.0
https://github.com/grpc/grpc/pull/23313 fixing https://github.com/grpc/grpc/issues/16210
hopefully it also improves situation rather than introduces new bugs around same area :)
@meteorcloudy The fix is merged as 6e94b05ac93a0ff2674a1ff3d8743481aaed4eed. Can you run the tests with it?
Just launched a downstream test here: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1701
The issue still exists for rules_go. https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1701#41043323-056e-4542-8e18-101aecd887dc
That is interesting, has to be some mechanism other than auto flow control that is issuing the pings
probably via https://github.com/grpc/grpc-java/blob/2fca42feb93f1bda1a80f6649d1e6304e9a67b08/core/src/main/java/io/grpc/internal/ClientTransport.java#L65 and likely by https://github.com/grpc/grpc-java/blob/2fca42feb93f1bda1a80f6649d1e6304e9a67b08/core/src/main/java/io/grpc/internal/KeepAliveManager.java
Could rules_go failure be another (flaky) issue?
I don't think it's a flaky issue, there are other projects failing with the same error, and rerun didn't fix them.
Ok, so we can try next one of two things
Is there a way to run those RBE tests on PRs?
Yes, I'll help test #12273
The RBE build seems to be fixed by upgrading to 1.32.x.
Another test with auto flow control feature enabled: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/1703
The rules_haskell failure is caused by something else. So it looks like upgrading grpc to 1.32.x does fix the issue and allows us to safely bring back auto flow control.
Great, then I can make the PRs: add 1.32.x, switch to 1.32.x & bring auto flow control back, drop 1.31.1
This is fixed by upgrading grpc java version
Most helpful comment
Great, then I can make the PRs: add 1.32.x, switch to 1.32.x & bring auto flow control back, drop 1.31.1