Bazel: Bazel occasionally stuck during execution when using multiplexed workers

Created on 21 Nov 2019  路  11Comments  路  Source: bazelbuild/bazel

Description of the problem:

Bazel occasionally gets stuck during execution when using multiplexed workers

This issue is mentioned in the Multiplexed Workers doc. I'm filing this bug to open it up to the community since we have not been successful in tracking it down so far. If anyone trying multiplexed workers runs into it, it would be valuable to hear back from you.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

We have not been able to reproduce this bug consistently. It seems to be due to a race condition.

What operating system are you running Bazel on?

Ubuntu 18.04

What's the output of bazel info release?

We're using a custom release (off of Bazel 0.29); however, the symptoms have also been observed by others running different versions of Bazel.

Have you found anything relevant by searching the web?

Similar hanging behavior was observed in one of the multiplexed worker tests.

Any other information, logs, or outputs that you want to share?

The following snippets from thread dumps on the bazel server and on the worker process show that each is stuck listening for messages from the other.

Bazel server:

"Thread-2": running, holding [0x000000072def8500]
    at java.io.FileInputStream.readBytes([email protected]/Native Method)
    at java.io.FileInputStream.read([email protected]/Unknown Source)
    at java.io.BufferedInputStream.fill([email protected]/Unknown Source)
    at java.io.BufferedInputStream.read([email protected]/Unknown Source)
    at java.io.FilterInputStream.read([email protected]/Unknown Source)
    at com.google.devtools.build.lib.worker.RecordingInputStream.read(RecordingInputStream.java:56)
    at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:253)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:275)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:280)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
    at com.google.protobuf.GeneratedMessageV3.parseDelimitedWithIOException(GeneratedMessageV3.java:347)
    at com.google.devtools.build.lib.worker.WorkerProtocol$WorkResponse.parseDelimitedFrom(WorkerProtocol.java:2279)
    at com.google.devtools.build.lib.worker.WorkerMultiplexer.waitResponse(WorkerMultiplexer.java:185)
    at com.google.devtools.build.lib.worker.WorkerMultiplexer.run(WorkerMultiplexer.java:210)

Worker process:

"main": running, holding [0x0000000704878b68]
    at java.io.FileInputStream.readBytes(Native Method)
    at java.io.FileInputStream.read(FileInputStream.java:255)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
    at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:246)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:267)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:272)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:48)
    at com.google.protobuf.GeneratedMessageV3.parseDelimitedWithIOException(GeneratedMessageV3.java:368)
    at com.google.devtools.build.lib.worker.WorkerProtocol$WorkRequest.parseDelimitedFrom(WorkerProtocol.java:1164)
    at higherkindness.rules_scala.common.worker.WorkerMain.process$1(WorkerMain.scala:53)
    at higherkindness.rules_scala.common.worker.WorkerMain.main(WorkerMain.scala:101)
    at higherkindness.rules_scala.common.worker.WorkerMain.main$(WorkerMain.scala:27)
    at higherkindness.rules_scala.workers.zinc.compile.ZincRunner$.main(ZincRunner.scala:55)
    at higherkindness.rules_scala.workers.zinc.compile.ZincRunner.main(ZincRunner.scala)

Important files:

  • The multiplexer is implemented in bazelbuild/bazel: WorkerMultiplexer.java
  • The worker used by the symptomatic multiplexed worker test mentioned earlier is also implemented in bazelbuild/bazel: ExampleWorkerMultiplexer.java (note: this is a multiplexer-compatible worker, not a multiplexer)
  • The worker we use (and which yielded the worker process thread dump snippet above) is implemented in a multiplexer-compatible branch of higherkindness/rules_scala: ZincRunner.scala, WorkerMain.scala
P2 team-Local-Exec bug

All 11 comments

@philwo are you able to chime in here?

@borkaehw do you have any information on this?

@susinmotion I worked with @SrodriguezO and it's an issue we are facing together. Unfortunately I don't have more context to provide. We tried to solve it with no luck so we are reaching out to the community for help. Do you see the same problem?

Ah, I see. We aren't using this feature yet, so we haven't been encountering this problem, but our tests do show it. I'll keep this issue open for our team (or anyone else who has ideas!), but it may be some time before we have a chance to investigate.

Any update here? We've just turned it on for the Kotlin rules and will monitor our builds. What's the sense of how frequently it occurs and has anyone got any leads?

We've been using multiplex workers with Scala for the performance and especially memory benefits for a while now. We do run into the deadlock somewhat frequently in dev and very rarely in CI. This might just be a numbers game though (we have a lot of developers running lots of builds every day).

As far as leads, we suspect it occurs when the worker throws some unhandled error and the multiplexer fails to adequately handle the response. We have not had bandwidth to dig further into this issue though.

We figured out what was causing this issue for us. It was a problem with our multiplex worker implementation. The issue occurred whenever the worker encountered a Fatal exception, in which case the worker would neither report the failure nor exit. More details here

Since Future only catches NonFatal exceptions, any Fatal exceptions would cause the Future to never complete, which meant a WorkResponse was never sent back to the bazel server, which would then be stuck waiting forever for a response.

The change I linked seems to have resolved the issue entirely for us. It seems @tomdegoede encountered a different deadlock on Bazel's end though, so I'll leave this issue open for now.

Hooray! @larsrc-google

Glad you found that! I'm looking into @tomdegoede's fix, but we don't have a reliable repro yet.

-Lars

As mentioned on PR #12219, the full solution is complicated. Do you (any of you still having these issues) get these issues after a Ctrl-C or a failed build, or just after regular successful builds? If it's the former case, better interrupt handling should help.

If anyone still sees this problem after the change I made above, please let me know.

Was this page helpful?
0 / 5 - 0 ratings