Bazel occasionally gets stuck during execution when using multiplexed workers
This issue is mentioned in the Multiplexed Workers doc. I'm filing this bug to open it up to the community since we have not been successful in tracking it down so far. If anyone trying multiplexed workers runs into it, it would be valuable to hear back from you.
We have not been able to reproduce this bug consistently. It seems to be due to a race condition.
Ubuntu 18.04
bazel info release?We're using a custom release (off of Bazel 0.29); however, the symptoms have also been observed by others running different versions of Bazel.
Similar hanging behavior was observed in one of the multiplexed worker tests.
The following snippets from thread dumps on the bazel server and on the worker process show that each is stuck listening for messages from the other.
Bazel server:
"Thread-2": running, holding [0x000000072def8500]
at java.io.FileInputStream.readBytes([email protected]/Native Method)
at java.io.FileInputStream.read([email protected]/Unknown Source)
at java.io.BufferedInputStream.fill([email protected]/Unknown Source)
at java.io.BufferedInputStream.read([email protected]/Unknown Source)
at java.io.FilterInputStream.read([email protected]/Unknown Source)
at com.google.devtools.build.lib.worker.RecordingInputStream.read(RecordingInputStream.java:56)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:253)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:275)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:280)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at com.google.protobuf.GeneratedMessageV3.parseDelimitedWithIOException(GeneratedMessageV3.java:347)
at com.google.devtools.build.lib.worker.WorkerProtocol$WorkResponse.parseDelimitedFrom(WorkerProtocol.java:2279)
at com.google.devtools.build.lib.worker.WorkerMultiplexer.waitResponse(WorkerMultiplexer.java:185)
at com.google.devtools.build.lib.worker.WorkerMultiplexer.run(WorkerMultiplexer.java:210)
Worker process:
"main": running, holding [0x0000000704878b68]
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:246)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:267)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:272)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:48)
at com.google.protobuf.GeneratedMessageV3.parseDelimitedWithIOException(GeneratedMessageV3.java:368)
at com.google.devtools.build.lib.worker.WorkerProtocol$WorkRequest.parseDelimitedFrom(WorkerProtocol.java:1164)
at higherkindness.rules_scala.common.worker.WorkerMain.process$1(WorkerMain.scala:53)
at higherkindness.rules_scala.common.worker.WorkerMain.main(WorkerMain.scala:101)
at higherkindness.rules_scala.common.worker.WorkerMain.main$(WorkerMain.scala:27)
at higherkindness.rules_scala.workers.zinc.compile.ZincRunner$.main(ZincRunner.scala:55)
at higherkindness.rules_scala.workers.zinc.compile.ZincRunner.main(ZincRunner.scala)
Important files:
@philwo are you able to chime in here?
@borkaehw do you have any information on this?
@susinmotion I worked with @SrodriguezO and it's an issue we are facing together. Unfortunately I don't have more context to provide. We tried to solve it with no luck so we are reaching out to the community for help. Do you see the same problem?
Ah, I see. We aren't using this feature yet, so we haven't been encountering this problem, but our tests do show it. I'll keep this issue open for our team (or anyone else who has ideas!), but it may be some time before we have a chance to investigate.
Any update here? We've just turned it on for the Kotlin rules and will monitor our builds. What's the sense of how frequently it occurs and has anyone got any leads?
We've been using multiplex workers with Scala for the performance and especially memory benefits for a while now. We do run into the deadlock somewhat frequently in dev and very rarely in CI. This might just be a numbers game though (we have a lot of developers running lots of builds every day).
As far as leads, we suspect it occurs when the worker throws some unhandled error and the multiplexer fails to adequately handle the response. We have not had bandwidth to dig further into this issue though.
We figured out what was causing this issue for us. It was a problem with our multiplex worker implementation. The issue occurred whenever the worker encountered a Fatal exception, in which case the worker would neither report the failure nor exit. More details here
Since Future only catches NonFatal exceptions, any Fatal exceptions would cause the Future to never complete, which meant a WorkResponse was never sent back to the bazel server, which would then be stuck waiting forever for a response.
The change I linked seems to have resolved the issue entirely for us. It seems @tomdegoede encountered a different deadlock on Bazel's end though, so I'll leave this issue open for now.
Hooray! @larsrc-google
Glad you found that! I'm looking into @tomdegoede's fix, but we don't have a reliable repro yet.
-Lars
As mentioned on PR #12219, the full solution is complicated. Do you (any of you still having these issues) get these issues after a Ctrl-C or a failed build, or just after regular successful builds? If it's the former case, better interrupt handling should help.
If anyone still sees this problem after the change I made above, please let me know.