Bazel: Bazel hangs indefinitely when building with remote cache

Created on 16 Jul 2020 · 71Comments · Source: bazelbuild/bazel

We intermittently see an issue where bazel will hang indefinitely while building a target:

[5,322 / 5,621] Compiling Swift module SomeTarget; 11111s remote-cache

Seemingly while trying to fetch it from the remote cache. In these cases once we kill the bazel process no execution logs or chrome trace logs are produced, so I don't have that info, but when I sample the process on macOS, or kill -QUIT PID it I get these logs:

sample log:

Call graph:
    2796 Thread_77950543   DispatchQueue_1: com.apple.main-thread  (serial)
    + 2796 start  (in libdyld.dylib) + 1  [0x7fff736f0cc9]
    +   2796 main  (in bazel) + 143  [0x106f197ff]
    +     2796 blaze::Main(int, char const* const*, blaze::WorkspaceLayout*, blaze::OptionProcessor*, unsigned long long)  (in bazel) + 5317  [0x106f012a5]
    +       2796 blaze::RunLauncher(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::StartupOptions const&, blaze::OptionProcessor const&, blaze::WorkspaceLayout const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::LoggingInfo*)  (in bazel) + 11470  [0x106f0477e]
    +         2796 blaze::RunClientServerMode(blaze_util::Path const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, blaze_util::Path const&, blaze::WorkspaceLayout const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, blaze::OptionProcessor const&, blaze::StartupOptions const&, blaze::LoggingInfo*, blaze::DurationMillis, blaze::DurationMillis, blaze::BlazeServer*)  (in bazel) + 4966  [0x106f113d6]
    +           2796 blaze::BlazeServer::Communicate(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<blaze::RcStartupFlag, std::__1::allocator<blaze::RcStartupFlag> > const&, blaze::LoggingInfo const&, blaze::DurationMillis, blaze::DurationMillis, blaze::DurationMillis)  (in bazel) + 1632  [0x106f09350]
    +             2796 grpc::ClientReader<command_server::RunResponse>::Read(command_server::RunResponse*)  (in bazel) + 496  [0x106f086f0]
    +               2796 cq_pluck(grpc_completion_queue*, void*, gpr_timespec, void*)  (in bazel) + 548  [0x106fda5e4]
    +                 2796 pollset_work(grpc_pollset*, grpc_pollset_worker**, long long)  (in bazel) + 1303  [0x106fb8fc7]
    +                   2796 poll  (in libsystem_kernel.dylib) + 10  [0x7fff738383d6]
    2796 Thread_77950545
    + 2796 start_wqthread  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b77]
    +   2796 _pthread_wqthread  (in libsystem_pthread.dylib) + 390  [0x7fff738f1aa1]
    +     2796 __workq_kernreturn  (in libsystem_kernel.dylib) + 10  [0x7fff738334ce]
    2796 Thread_77950552: default-executor
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 GrpcExecutor::ThreadMain(void*)  (in bazel) + 550  [0x106fbce36]
    +         2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950553: resolver-executor
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 GrpcExecutor::ThreadMain(void*)  (in bazel) + 550  [0x106fbce36]
    +         2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950554: grpc_global_timer
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
    +       2796 timer_thread(void*)  (in bazel) + 895  [0x106fc9dff]
    +         2796 gpr_cv_wait  (in bazel) + 102  [0x106fe8756]
    +           2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
    +             2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]
    2796 Thread_77950566
    + 2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
    +   2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
    +     2796 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (blaze::BlazeServer::*)(), blaze::BlazeServer*> >(void*)  (in bazel) + 62  [0x106f1950e]
    +       2796 blaze::BlazeServer::CancelThread()  (in bazel) + 152  [0x106f073d8]
    +         2796 blaze_util::PosixPipe::Receive(void*, int, int*)  (in bazel) + 25  [0x1070ff1f9]
    +           2796 read  (in libsystem_kernel.dylib) + 10  [0x7fff7383281e]
    2796 Thread_77950583: grpc_global_timer
      2796 thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]
        2796 _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
          2796 grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix(char const*, void (*)(void*), void*, bool*)::'lambda'(void*)::__invoke(void*)  (in bazel) + 125  [0x106fe9edd]
            2796 timer_thread(void*)  (in bazel) + 895  [0x106fc9dff]
              2796 gpr_cv_wait  (in bazel) + 120  [0x106fe8768]
                2796 _pthread_cond_wait  (in libsystem_pthread.dylib) + 698  [0x7fff738f5425]
                  2796 __psynch_cvwait  (in libsystem_kernel.dylib) + 10  [0x7fff73834882]

Total number in stack (recursive counted multiple, when >=5):
        5       _pthread_start  (in libsystem_pthread.dylib) + 148  [0x7fff738f5109]
        5       thread_start  (in libsystem_pthread.dylib) + 15  [0x7fff738f0b8b]

Sort by top of stack, same collapsed (when >= 5):
        __psynch_cvwait  (in libsystem_kernel.dylib)        11184
        __workq_kernreturn  (in libsystem_kernel.dylib)        2796
        poll  (in libsystem_kernel.dylib)        2796
        read  (in libsystem_kernel.dylib)        2796

jvm.out produced by sigquit https://gist.github.com/keith/3c77f7e49c108964596440a251c05929

It's worth noting that during this time bazel is consuming 0% CPU, so it seems likely that it's locked waiting on a response that has already timed out. Currently we're using a gRPC remote cache, with the default remote timeout of 60, and the default value for --cpus.

What operating system are you running Bazel on?

macOS

What's the output of `bazel info release`?

3.3.0 we've heard reports from others that this repros on 3.2.0 as well, we have since updated to 3.4.1 but we don't see this often enough to know if it's an issue there yet as well.

Have you found anything relevant by searching the web?

There are various related sounding issues but AFAICT they're all closed

P1 team-Remote-Exec bug

Source

keith

👍3

All 71 comments

We see this issue as well using an HTTP remote cache on bazel 3.2.0

segiddins on 16 Jul 2020

Cc @coeuvre

meisterT on 16 Jul 2020

I assume this is the stack trace of the blocking thread:

"skyframe-evaluator 12": awaiting notification on [0x0000000645000ca0]

    at jdk.internal.misc.Unsafe.park([email protected]/Native Method)

    at java.util.concurrent.locks.LockSupport.park([email protected]/Unknown Source)

    at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:497)

    at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)

    at com.google.devtools.build.lib.remote.util.Utils.getFromFuture(Utils.java:56)

    at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:219)

    at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:331)

    at com.google.devtools.build.lib.remote.RemoteSpawnCache.lookup(RemoteSpawnCache.java:179)

    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:123)

    at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:96)

    at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:39)

    at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:62)

    at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:331)

    at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)

    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:779)

    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:925)

    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:896)

    at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:128)

    at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:80)

    at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:419)

    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:897)

    at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:300)

    at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:438)

    at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:398)

    at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/Unknown Source)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/Unknown Source)

    at java.lang.Thread.run([email protected]/Unknown Source)

meisterT on 16 Jul 2020

👍1

Currently, Bazel only triggers timeout error when there is no byte downloaded/uploaded within --remote_timeout seconds.

In this case, I think it is still downloading bytes from remote cache, but maybe with slow network speed, so it never triggers timeout, hence, hangs for a long time.

Should we change to trigger timeout error if the download/upload isn't finished within --remote_timeout seconds?

coeuvre on 17 Jul 2020

FWIW in our case the targets we've seen this happen with are definitely <10mbs, but I suppose there could be an intermittent issue with network performance. I'm definitely a +1 for changing remote_timeout to timeout when the request hasn't completed in that time, but even if we don't want to change that behavior I definitely think there should be some way to do that since otherwise these seem to hang "forever"

keith on 21 Jul 2020

Bazel only triggers timeout error when there is no byte downloaded/uploaded within --remote_timeout seconds.

This is only true for HTTP remote cache.

For gRPC remote cache, bazel did trigger timeout with withDeadlineAfter and will retry for --remote_retries times which is default to 5.

In theory, the max waiting time for downloading a file/blob from cache should be 60s * (1+5) = 360s by default. There must be something wrong in this case.

coeuvre on 10 Aug 2020

keith on 17 Aug 2020

Interesting. Is your remote cache behind a load balancer? If so, what kind?

ulfjack on 17 Aug 2020

We're using google RBE as a cache

keith on 17 Aug 2020

I've seen it happen with an HTTP cache behind AWS ELB, and a gRPC cache not sitting behind anything. I can try with an unbalanced HHTP cache if that would be helpful?

segiddins on 17 Aug 2020

Let me repost part of my PR description that may be relevant here:

We have seen cases where GCP's external TCP load balancer silently drops connections without telling the client. If this happens when the client is waiting for replies from the server (e.g., when all worker threads are in blocking 'execute' calls), then the client does not notice the connection loss and hangs.

In our testing, the client unblocked after ~2 hours without this change, although we are not sure why (maybe a default Linux Kernel timeout on TCP connections?). With this flag set to 10 minutes, and running builds over night, we see repeated 10-minute gaps in remote service utilization, which seems pretty clear evidence that this is the underlying problem.

ulfjack on 17 Aug 2020

While we think it's due to GCP's TCP load balancer, we technically can't rule out any other system on the network between my machine and the cluster. What is surprising to us that we have seen it happen in the middle of the build. I let some builds run through last night, and we have clear evidence of ~10 hangs during that period, all of which recovered after exactly 10 minutes, with 10 minutes being exactly the keep-alive time I set on Bazel.

used-executors-keepalive

ulfjack on 17 Aug 2020

We think that some of the other spikes could also be caused by the same issue, just that Bazel happened to recover faster because it tried to make a remote call rather than just wait for replies.

ulfjack on 17 Aug 2020

I think the best case would be to get my PR merged, and then test with specific settings for the timeout and, ideally, monitor the length of the gaps and see if there's a correlation. I think that most clearly shows whether it's a similar problem or a different one.

What I did to debug this was to take heap dumps from Bazel and the server and analyze them with YourKit Profiler. YourKit supports searching for and inspecting individual objects, so I could clearly see that Bazel had an open connection and the server did not have an open connection. This asymmetry tipped me off that it was something in the network.

ulfjack on 17 Aug 2020

Ulf, please see https://docs.google.com/document/d/13vXJTFHJX_hnYUxKdGQ2puxbNM3uJdZcEO5oNGP-Nnc/edit#heading=h.z3810n36by5c - Jakob was investigating keepalives last year and concluded that gRPC keepalives were unworkable because there's also a limit on the number of consecutive pings, where that limit is "2" in practice.

I'm generally inclined to call this a bug - we (mostly Dave and Chi) have been chasing this steadily and have observed that the outstanding RPC can hang for longer than the RPC timeout, which is fairly clearly either a gRPC or a bazel bug since gRPC _should_ have surfaced a timeout by then if nothing else. But the fact that it's also affecting HTTP caches could mean it's a bazel bug? (Or a Netty bug, if both use netty?) Either way, there's some concurrency problem where expected signals (RPC failures, timeouts) are not getting handled properly. And if we believe that timeouts are being mishandled and that it may be in the gRPC/netty/bazel stack somewhere, it may also be the case that connection terminations should be noticed but are being mishandled by the same bug.

EricBurnett on 17 Aug 2020

AIUI, Bazel currently doesn’t set a timeout on execute calls. The proposal
to cancel the call and retry is not feasible IMO as it doesn’t allow
distinguishing between actual cancellation and keep-alive cancellation. If
an action is really expensive, letting it run to completion if nobody is
going to use the result seems unacceptable.

We could send app-level keep-alives from the server but I don’t see how
that’s better than client-side pings. It’s actually worse because it’ll
require more network bandwidth (client-side pings are per connection rather
than per call). Did anyone discuss this with gRPC team? Because my research
seemed to indicate that they added the client-side keep-alives for this
purpose.

Also, we need a workaround now-ish because the current state of affairs is
that remote execution is literally unusable in some not uncommon scenarios.

Ideally, someone would fix the load balancers to sent a TCP reset instead
of us working around the issue, but I won’t hold my breath.

On Mon 17. Aug 2020 at 23:58, Eric Burnett notifications@github.com wrote:

>
>

Ulf, please see
https://docs.google.com/document/d/13vXJTFHJX_hnYUxKdGQ2puxbNM3uJdZcEO5oNGP-Nnc/edit#heading=h.z3810n36by5c

Jakob was investigating keepalives last year and concluded that gRPC
keepalives were unworkable because there's also a limit on the number of
consecutive pings, where that limit is "2" in practice.

I'm generally inclined to call this a bug - we (mostly Dave and Chi) have
been chasing this steadily and have observed that the outstanding RPC can
hang for longer than the RPC timeout, which is fairly clearly either a gRPC
or a bazel bug since gRPC should have surfaced a timeout by then if
nothing else. But the fact that it's also affecting HTTP caches could mean
it's a bazel bug? (Or a Netty bug, if both use netty?) Either way, there's
some concurrency problem where expected signals (RPC failures, timeouts)
are not getting handled properly. And if we believe that timeouts are being
mishandled and that it may be in the gRPC/netty/bazel stack somewhere, it
may also be the case that connection terminations should be noticed but
are being mishandled by the same bug.

—
You are receiving this because you commented.

Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/11782#issuecomment-675135822,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABYD2YN3LGBYF7MVXE4CXNDSBGRY7ANCNFSM4O3CMBFA
.

ulfjack on 18 Aug 2020

Based on my professional opinion that that would be a completely absurd way to specify keep-alive ping behavior given what the purpose of a keep-alive ping is, I tried it with the gRPC Java implementation, and I can confirm that it sends PING packets at regular intervals (not just 2) and the server keeps replying to them, as I would expect. I also observe that gRPC seems to enforce an undocumented 10s minimum ping interval regardless of what's specified on the server-side (why???).

Screenshot of Wireshark which clearly shows 10s ping intervals:
wireshark-grpc-ping

I also tried to ping the server more aggressively than the server allows, and that does seem to result in a GOAWAY, as expected (based on my assumption that the implementation is dumb). Ideally, the server would tell the client about the minimum allowed keep-alive time rather than having to do that manually.

There's also @ejona86's comment on https://github.com/grpc/grpc-java/issues/7237 which seems to indicate that the GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA setting may be misunderstood. I was not able to find more useful documentation about the setting - note that Java does not have such a setting.

My conclusion is that @buchgr's conclusion in the doc is incorrect, at least for the Java gRPC library. I have not tried it with other language implementations, but I would be surprised if they broke keep-alive pings in this way.

ulfjack on 18 Aug 2020

It is unclear from the doc whether @buchgr tested the actual behavior or came to the conclusion purely based on reading the docs. Also, after reading the comments more carefully, it looks like both @EricBurnett and @bergsieker expressed incredulity at gRPC keep-alive being defined that way, and @buchgr said he sent an email to the mailing list but never reported back.

ulfjack on 18 Aug 2020

I can't find the email @buchgr mentioned, but all discussions about keep-alive pings on the main gRPC mailing list seem to indicate that pings work exactly the way I would expect rather than having some random limit on how many pings can be sent by the client.

ulfjack on 18 Aug 2020

Apologies for the many posts, but this is really important for us, because, as mentioned, this makes remote execution unusable in some scenarios, and I'd like to have a workaround / fix in sooner rather than later.

ulfjack on 18 Aug 2020

As part of investigating https://github.com/grpc/grpc/issues/17667 (not very useful of a reference) I discovered that, yes, C did apply the GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA setting to keepalive pings. And yes, that is absurd. It was "working as expected" by the C core team though, so any discussions with them probably did not lead anywhere. Bringing it to my attention at that point though would have helped. There are legitimate reasons for GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA related to BDP monitoring, but it caused problems to expose the option and the option should not have impacted keepalive. The very existence of the option is what was causing problems here, in a "tail wagging the dog" way. They have carved out some time to fix the implementation; it's not trivial for a variety of reasons.

Java should behave as you expect. The 10 second minimum is defined in https://github.com/grpc/proposal/blob/master/A8-client-side-keepalive.md . The Java documentation states that low values may be increased, but it does not define 10 seconds specifically. Very recently Java did receive something similar to GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA internally, but it isn't exposed as an option and it does not impact keepalive.

When gRPC clients receive a GOAWAY with too_many_pings they should increase their keepalive time by 2x. This is to reduce the impact when things are misconfigured, while not removing errors entirely so the user can be informed of the issue, as they wouldn't be getting their expected keepalive setting.

ejona86 on 18 Aug 2020

It sounds like you are saying a TCP LB is being used (not L7).

You should be aware there is a 10 minute idle timeout. https://cloud.google.com/compute/docs/troubleshooting/general-tips#communicatewithinternet . But it doesn't seem you are hitting it. In order to avoid that you need a keepalive time less than 10 minutes (say, 9 minutes). I'd suggest the gRPC _server_ to configure this keepalive.

You could also be hitting a default 30 second idle timeout. This value can be increased, but probably not above 10 minutes. Again, it would be best for the server to manage the keepalive here, as it is controlling the TCP proxy settings and can update them when things change.

ejona86 on 18 Aug 2020

👍1

@ejona86 thanks for those details! Unfortunate to hear it does in fact work this way, but good to have confirmation.

@ulfjack :

AIUI, Bazel currently doesn’t set a timeout on execute calls.

Interesting! We should also get that fixed. Bazel should add a timeout to all outgoing RPCs. Retry logic here is a little tricky due to the Execute/WaitExecution dance and the need for supporting arbitrarily long actions, but I think the result should be something like:

Bazel sets a timeout on all Execute and WaitExecution calls. Probably just the value of the remote RPC timeout flag, like everything else, for now? But it could go reasonably low (few minutes) if we also:
Correct the retrier to not count DEADLINE_EXCEEDED on Execute or WaitExecution against the maximum retries. That's an expected, normal case, meaning "please retry waiting". Right now it does, which requires raising max retries to support really long actions :(.

We could send app-level keep-alives from the server but I don’t see how that’s better than client-side pings.

I'd definitely recommend it. Client-side pings happen at the connection level, but different HTTP2 proxies may also apply idle timeouts to the carried streams IIUC. For Execute requests RBE intentionally sends a no-op streaming response once per minute, which has proven sufficient to keep streams alive in all networking environments our clients run in.

I'm not fully understanding the scenario you're running into with dropped connections, but this may also suffice to avoid your issue, with server-side changes only. That should be quick and easy for you to land.

EricBurnett on 18 Aug 2020

👍1

I can't say that I still remember the details of this, but here are the contents of the e-mail:

My e-mail

Hey,

I'd like to ask a clarifying question regarding the gRPC keepalive spec for the following scenario: I have an active channel with one or more open server-streaming calls that do not send any messages for hours. The gRPC keepalive time is set to 30s. All other knobs are left at their defaults.

My understanding of the spec is that in this scenario HTTP/2 PING frames will be sent at the following times after the channel was opened: 30s, 60s, 6 minutes, 11 minutes, ...

I am thinking this because GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA defaults to 2 and GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS defaults to 5 minutes.

Is this correct?

Response from the gRPC team

Hi,

After 2 pings without data, we will NOT send any more pings. So if there is no data, two pings will be sent 5 minutes apart after which no pings will be sent.

buchgr on 18 Aug 2020

I am using a TCP load balancer rather than an HTTPS load balancer in this cluster because the HTTPS load balancer is significantly more expensive. We previously ran into the 30s idle timeout on the HTTPS load balancer but not the TCP load balancer. As far as I can tell, there is no way to set an idle timeout on an external TCP load balancer in GCP, and it's not clear from the GCP documentation whether there is an idle timeout in this case (it doesn't mention external TCP load balancers, and I also couldn't find a setting for this).

I'm also fairly sure that the connection was active at the time of the timeout, although I can't say this with complete certainty. If I'm right, then a server-side keep-alive ping isn't going to do diddly.

Why should this block merging PR #11957? The Java gRPC library doesn't have this bug, and it's an opt-in setting.

What's the purpose of setting a deadline on an execute call given that the client is just going to retry? This seems like a blatant abuse of functionality that is intended for something else. If we want to detect that the connection is dead, the right way to do that is to use keep-alives.

ulfjack on 18 Aug 2020

I am also fairly certain that internal TCP load balancers do not have an idle timeout given that we've run extensive tests with them and not seen any issues.

ulfjack on 18 Aug 2020

What's the purpose of setting a deadline on an execute call given that the client is just going to retry? This seems like a blatant abuse of functionality that is intended for something else. If we want to detect that the connection is dead, the right way to do that is to use keep-alives.

Independent problems - I agree that gRPC should detect dead connections (ideally without us needing to manually turn on keep-alives, especially if the gRPC teams can't agree on whether they're well-defined for this use-case). But adding a timeout to RPCs is good hygene for many other reasons - defends against the case of servers getting hung for whatever reason, plays nicer for proxies who don't like having to hold connections open for many hours, etc. In most service environments, it's best practice for 100% of RPCs to set timeouts, so anything without is itself a smell. And since I believe we need to change bazel's behaviour around timeouts regardless, once we do there's no harm in also setting timeouts on execution RPCs.

Why should this block merging PR #11957? The Java gRPC library doesn't have this bug, and it's an opt-in setting.

Depends if it's a gRPC bug or not - sounds like the C folk don't think so. And so I'd personally only want to land that if either we can confirm with gRPC folk that it's in line with the expected semantics of keepalives (i.e. the C implementation is incorrect), or under some 'experimental' guard if not (if it's possible the Java implementation is actually incompliant, and this bazel feature won't work properly with any "proper" gRPC implementation).

EricBurnett on 18 Aug 2020

Depends if it's a gRPC bug or not - sounds like the C folk don't think so.

They did think so, but we're all on the same page now and it is agreed it needs to change.

The internal email exchange was with @yashykt who has already worked on the design to fix it. @yashykt, did preventing GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA from impacting keepalive get on the Q3 OKRs, or do we expect it more in the Q4 time frame?

ejona86 on 18 Aug 2020

Ok cool - then I have no objection to landing the bazel change to be able to use keepalive.

EricBurnett on 18 Aug 2020

They did think so, but we're all on the same page now and it is agreed it needs to change.

This is correct. Sorry for the trouble!

To summarize, currently, (as documented on https://github.com/grpc/grpc/blob/master/doc/keepalive.md) in gRPC Core, simply setting GRPC_ARG_KEEPALIVE_TIME_MS is not enough to enable keepalives properly. GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA and GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS are two additional channel args that need to be set to avoid bottlenecking keepalives.

Given the current state, please set GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA to 0, and GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS to a value same as GRPC_ARG_KEEPALIVE_TIME_MS.

@yashykt, did preventing GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA from impacting keepalive get on the Q3 OKRs, or do we expect it more in the Q4 time frame?

Yes, it is _planned_ for Q3.
Edit: The change for GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA is planned for Q3. The change for GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA would need more discussion.

yashykt on 19 Aug 2020

I implemented server-side pings, and they don't seem to help. Ran a few builds earlier today, and the second build got stuck for about an hour. My conclusion is that it is not an idle timeout.

ulfjack on 25 Aug 2020

This is a huge issue for us, any other suggestions for debugging?

keith on 9 Sep 2020

Hey Keith, I am working on this issue but progressing slowly since I'm still new to the Bazel source code. Sorry for your inconvenience.

I have found a way to produce a hang by setting a breakpoint here (using Bazel 3.4.1), running a build with remote execution, waiting a few minutes after that's hit, then clearing and resuming...

The stack frames when it hangs are similar:

"skyframe-evaluator 282" #1660 prio=5 os_prio=0 cpu=1.67ms elapsed=935.91s tid=0x00007f33e802b800 nid=0x3f530 waiting on condition  [0x00007f337df0b000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x000000061a061fc0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park([email protected]/Unknown Source)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/Unknown Source)
        at java.util.concurrent.LinkedBlockingQueue.take([email protected]/Unknown Source)
======> at io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:642)
======> at io.grpc.stub.ClientCalls$BlockingResponseStream.waitForNext(ClientCalls.java:554)
======> at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:567)
======> at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.lambda$executeRemotely$0(GrpcRemoteExecutor.java:151)
        at com.google.devtools.build.lib.remote.GrpcRemoteExecutor$$Lambda$589/0x0000000800674840.call(Unknown Source)
        at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:115)
        at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.executeRemotely(GrpcRemoteExecutor.java:134)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$0(RemoteSpawnRunner.java:329)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner$$Lambda$573/0x0000000800670c40.call(Unknown Source)
        at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:115)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:309)
        at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:241)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
        at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:334)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:124)
        ...

"skyframe-evaluator 366" #460 prio=5 os_prio=0 cpu=3.60ms elapsed=737.91s tid=0x00007f38e0396000 nid=0x4a6b5 waiting on condition  [0x00007f3882995000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x0000000622b12c00> (a com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture)
        at java.util.concurrent.locks.LockSupport.park([email protected]/Unknown Source)
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:497)
        at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)
======> at com.google.devtools.build.lib.remote.util.Utils.getFromFuture(Utils.java:59)
======> at com.google.devtools.build.lib.remote.RemoteCache.waitForBulkTransfer(RemoteCache.java:219)
======> at com.google.devtools.build.lib.remote.RemoteExecutionCache.uploadMissing(RemoteExecutionCache.java:58)
        at com.google.devtools.build.lib.remote.RemoteExecutionCache.ensureInputsPresent(RemoteExecutionCache.java:110)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$0(RemoteSpawnRunner.java:319)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner$$Lambda$585/0x0000000800685040.call(Unknown Source)
        at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
        at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:115)
        at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:308)
        at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
        at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:100)
        at com.google.devtools.build.lib.actions.SpawnStrategy.beginExecution(SpawnStrategy.java:47)
        at com.google.devtools.build.lib.exec.SpawnStrategyResolver.beginExecution(SpawnStrategyResolver.java:65)
        at com.google.devtools.build.lib.analysis.actions.SpawnAction.beginExecution(SpawnAction.java:332)
        at com.google.devtools.build.lib.actions.Action.execute(Action.java:127)
        ...

coeuvre on 10 Sep 2020

My PR #11957 was merged. Can you try setting --grpc_keepalive_time and see if that helps? Caveat: the service needs to allow the keep-alive time, or it'll send a GOAWAY. The recommended best practice is to increase the keep-alive time on GOAWAY, but I couldn't find an easy way to implement that yet.

ulfjack on 10 Sep 2020

(The service is supposed to document what acceptable values for keepalive_time are.)

ulfjack on 10 Sep 2020

The first stack trace from @coeuvre looks exactly like what I've seen in my tests. I haven't seen anything like the second stack trace, but it could be the same root cause.

ulfjack on 10 Sep 2020

@ulfjack, the _grpc implementation_ should increase the keepalive time automatically on GOAWAY+too_many_pings. That should prevent outages but probably doesn't avoid the errors entirely. As an application author you can just notice the errors and respond to it out-of-band with code/flag changes.

ejona86 on 10 Sep 2020

I discovered a potentially interesting knock on effect of this. In our case we end up killing this bazel process, and the next bazel invocation on the machine sometimes hangs waiting for the previous one to finish:

Another command (pid=73936) is running.  Waiting for it to complete on the server...

Yet when queried that other pid doesn't exist. I assume this is unrelated to the root cause, and is a side effect of poor cleanup.

keith on 14 Sep 2020

There is an attempt #12264 to upgrade gRPC to v1.32.x. Auto flow control is enabled by default at that version which enables pinging. Maybe we can test with that.

coeuvre on 16 Oct 2020

I can reliably reproduce the issue. Happy to test when gRPC is upgraded.

ulfjack on 19 Oct 2020

Wow, would you please share how to reproduce?

coeuvre on 19 Oct 2020

I'm running a large build against our proprietary software cluster from home in a loop. So far, it always got stuck after a few iterations. I can give you access to our European cluster (London, UK), but I'm not sure that's going to be very helpful.

ulfjack on 19 Oct 2020

Thanks for sharing the setup! I will try to reproduce with RBE this way.

coeuvre on 20 Oct 2020

My setup is a test build with 2000 actions where each action has a random text file with size ranged from 1K~4M as input and it just copy the input to the output. To simulate bad network conditions, my incoming network is configured by netem with 10% packet drop and 300ms delay. However, there isn't any hangs occurred during hours running this test build with RBE. Any suggestions?

coeuvre on 20 Oct 2020

I still suspect the GCP load balancer, and if you're talking to RBE, I think you may not go through the load balancer. If you have a second machine, you could try running the RemoteWorker on the second machine and pulling out the network cable at a moment when Bazel is waiting for actions to finish. If my theory is correct, that should repro it as well.

ulfjack on 21 Oct 2020

FYI after enabling --grpc_keepalive_time=20s with bazel 3.7.0 I still hit this issue:

        at com.google.devtools.build.lib.remote.GrpcCacheClient.lambda$downloadBlob$10(GrpcCacheClient.java:296)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:192)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.doFallback(AbstractCatchingFuture.java:179)
        at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:124)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
        at com.google.common.util.concurrent.AbstractFuture.setFuture(AbstractFuture.java:800)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:203)
        at com.google.common.util.concurrent.AbstractCatchingFuture$AsyncCatchingFuture.setResult(AbstractCatchingFuture.java:179)
        at com.google.common.util.concurrent.AbstractCatchingFuture.run(AbstractCatchingFuture.java:133)
        at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
        at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1174)
        at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:969)
        at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:760)
        at com.google.common.util.concurrent.SettableFuture.setException(SettableFuture.java:53)
        at com.google.devtools.build.lib.remote.GrpcCacheClient$1.onError(GrpcCacheClient.java:344)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:449)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at com.google.devtools.build.lib.remote.util.NetworkTime$NetworkTimeCall$1.onClose(NetworkTime.java:93)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:700)
        at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
        at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
        at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
        at io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:399)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:521)
        at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:66)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:641)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$700(ClientCallImpl.java:529)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:703)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:692)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        ... 3 more
    Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERROR
Received Goaway
load_shed
        at io.grpc.Status.asRuntimeException(Status.java:533)
        ... 24 more

[3,363 / 5,260] Compiling Swift module Yams; 91s remote-cache ... (19 actions, 18 running)
[3,525 / 5,348] Compiling Swift module Yams; 106s remote-cache ... (13 actions, 12 running)
[3,836 / 5,759] Compiling Swift module Yams; 124s remote-cache ... (13 actions, 12 running)
[4,664 / 6,893] Compiling Swift module Yams; 145s remote-cache ... (13 actions, 12 running)
[5,169 / 7,083] Compiling Swift module Yams; 168s remote-cache ... (13 actions running)
[5,341 / 7,083] Compiling Swift module Yams; 209s remote-cache
[5,341 / 7,083] Compiling Swift module Yams; 242s remote-cache
[5,341 / 7,083] Compiling Swift module Yams; 281s remote-cache
[5,341 / 7,083] Compiling Swift module Yams; 325s remote-cache
[5,341 / 7,083] Compiling Swift module Yams; 376s remote-cache
[5,341 / 7,083] Compiling Swift module Yams; 434s remote-cache
[5,341 / 7,083] Compiling Swift module Yams; 501s remote-cache
[5,341 / 7,083] Compiling Swift module Yams; 578s remote-cache
[5,341 / 7,083] Compiling Swift module Yams; 667s remote-cache
[5,341 / 7,083] Compiling Swift module Yams; 769s remote-cache

keith on 22 Oct 2020

Thanks for your update, Keith.

The commits upgrading gRPC to v1.32.x are not released with bazel 3.7.0. Is there any change that you can try with master branch or latest green build to verify whether upgrading gRPC helps?

coeuvre on 22 Oct 2020

If you have a second machine, you could try running the RemoteWorker on the second machine and pulling out the network cable at a moment when Bazel is waiting for actions to finish. If my theory is correct, that should repro it as well.

This doesn't work after some experiments. Netty can successfully detect that the network is unreachable/channel is closed after pulling out the cable. Maybe we need a proxy server between them?

coeuvre on 22 Oct 2020

Auto flow control is enabled by default at that version which enables pinging.

Those pings should have no influence here. Those pings are only sent _after data is received_, so if there is a connectivity problem, they are highly unlikely to help.

Received Goaway
load_shed

The gRPC implementations don't produce that GOAWAY. So that makes me believe there is an L7 LB+proxy in play. Since --grpc_keepalive_time=20s didn't work, that probably means the connection between the LB+proxy and the backend is the one suffering.

ejona86 on 22 Oct 2020

Received Goaway
load_shed

The gRPC implementations don't produce that GOAWAY. So that makes me believe there is an L7 LB+proxy in play.

This is a loadshedding issue on the RBE side. https://github.com/bazelbuild/continuous-integration/issues/1055

coeuvre on 23 Oct 2020

Received Goaway
load_shed

The gRPC implementations don't produce that GOAWAY. So that makes me believe there is an L7 LB+proxy in play.

This is a loadshedding issue on the RBE side. bazelbuild/continuous-integration#1055

I'm not concerned with the load_shed error itself. Since client-side keepalive didn't resolve the hangs, it likely means the hanging connection is on the RBE side between the L7 proxy and the backend.

ejona86 on 23 Oct 2020

I just learned that envoy is getting hit with this as well, which might be helpful as a public project experiencing this issue. Example:

https://github.com/envoyproxy/envoy/runs/1298352337
https://dev.azure.com/cncf/envoy/_build/results?buildId=55218&view=logs&jobId=8c169225-0ae8-53bd-947f-07cb81846cb5&j=8c169225-0ae8-53bd-947f-07cb81846cb5&t=6b0ace90-28b2-51d9-2951-b1fd35e67dec

keith on 23 Oct 2020

Anyone who's seeing load_shed responses from RBE, please comment on bazelbuild/continuous-integration#1055. I don't think they're related to the indefinite hang that this issue is about.

ulfjack on 26 Oct 2020

#12422 adds check around NetworkTime to ensure we don't break gRPC.
#12426 adds checks around SettableFutures in remote module to ensure they are all set.

All these places can throw an unchecked exception in theory. By inserting code there, which randomly throws RuntimeException, we can reproduce the hangs and the stack traces are same with what we get from users.

At this point, I believe the root cause of the hanging issue is that an unchecked exception was thrown during the RPC so that a Future never completes. It's hard to reproduce locally because the conditions are hard to meet e.g.

The issue is fixed by the PRs in theory. Can you help to verify?

coeuvre on 6 Nov 2020

👍1

I build a Bazel binary from head including both #12422 and #12426 and I still see hangs.

ulfjack on 11 Nov 2020

Where were the hangs? If it is stuck at remote execution, can you try with --experimental_remote_execution_keepalive?

coeuvre on 12 Nov 2020

Maybe unrelated, but I see this error when I enable --experimental_remote_execution_keepalive:

ERROR: /home/ulfjack/Google/tensorflow/tensorflow/core/util/BUILD:364:24: Action tensorflow/core/util/version_info.cc failed: Exec failed due to IOException: /dev/null.remote (Permission denied)

I have had it running for 4 hours without any major hangs.

ulfjack on 12 Nov 2020

Well, I spoke too soon. It hung itself up just after my previous reply. Output looks like this:

[8,281 / 11,167] 75 actions, 48 running
    Compiling tensorflow/core/kernels/training_ops.cc; 2945s remote
    Compiling tensorflow/core/kernels/list_kernels.cc; 2885s remote
    Compiling tensorflow/lite/toco/tflite/operator.cc; 2860s remote
    Compiling tensorflow/lite/toco/tflite/op_version.cc; 2860s remote
    Compiling tensorflow/lite/toco/tflite/import.cc; 2860s remote
    Compiling tensorflow/lite/toco/tflite/export.cc; 2860s remote
    Compiling .../compiler/mlir/tensorflow/ir/tf_ops_n_z.cc; 2860s remote
    Compiling .../lite/toco/logging/conversion_log_util.cc; 2860s remote ...

ulfjack on 12 Nov 2020

Well, I spoke too soon. It hung itself up just after my previous reply. Output looks like this:

Oh no 😱

What does the stack trace say and is there anything interesting in the jvm.out? I think Chi added some more logging that might be useful for analysis.

philwo on 12 Nov 2020

jvm.out is empty, but I didn't capture a stack trace. Will rerun and see if I can get one.

ulfjack on 12 Nov 2020

Note that I was able to reproduce a hang somewhat regularly with illegal GO_AWAY behavior on the server side. I haven't tried a build with Chi's fixes yet, but if you're working with a server that might send GO_AWAY frames that's a possibility.

bergsieker on 12 Nov 2020

The cluster I'm currently working with is not configured to send GO_AWAY frames.

ulfjack on 12 Nov 2020

bazel-hanging-threads.txt

Attached a stack trace. It's possible I removed jvm.out before I looked at it.

I ran Bazel in a loop since last night, and it got stuck 4 times.

ulfjack on 13 Nov 2020

Nope, no output in jvm.out.

ulfjack on 13 Nov 2020

Nope, no output in jvm.out.

output from server is stored at $output_base/java.log.

bazel-hanging-threads.txt

Attached a stack trace.

This stack trace shows that Bazel was stuck at remote execution calls (without --experimental_remote_execution_keepalive). This is different from the hangs reported by Keith and other users - stack traces showed that Bazel was stuck at remote cache calls. Both #12422 and #12426 are used to fix hangs at remote cache calls.

Hangs at remote execution calls is explaind by this design doc and should be fixed with --experimental_remote_execution_keepalive. Does the error from https://github.com/bazelbuild/bazel/issues/11782#issuecomment-726080362 stop you from using this flag? If so, could you please share more details (e.g. jvm.log, java.log) so I can look into?

coeuvre on 16 Nov 2020

Hmm, ok. The stacktrace is from stuck builds with --experimental_remote_execution_keepalive. I think this is a log file from one of those runs.

java.log.txt

ulfjack on 16 Nov 2020

Interesting. Remote execution with --experimental_remote_execution_keepalive should use ExperimentalGrpcRemoteExecutor but the stack traces showed that Bazel is stuck at GrpcRemoteExecutor which is not created when the flag is enabled. Would please double check the hangs is from a build with --experimental_remote_execution_keepalive?

coeuvre on 16 Nov 2020

Ok, new stack trace.

bazel-hangs-stacks.txt

ulfjack on 16 Nov 2020

And a matching log file.

java.log.txt

ulfjack on 16 Nov 2020

Can't find hints from java.log.

Since both GrpcRemoteExecutor and ExperimentalGrpcRemoteExecutor use blocking gRPC calls, we don't have the risks like we had for remote cache calls where a future never completes resulting in a hang (unless a bug inside gRPC). The new stack trace shows that Bazel is waiting for network data. What's the values of --remote_timeout/--remote_retries here? Is that possible we didn't reach the maximum time (--remote_timeout x (--remote_retries + 1))?

coeuvre on 17 Nov 2020

Closing as the issue seems to be fixed. Feel free to re-open if it happens again.

coeuvre on 9 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings