Bazel: Multiplex persistent worker protocol

Created on 16 Apr 2017 · 32Comments · Source: bazelbuild/bazel

Background

Bazel spawns 4 persistent workers processes and sends them requests in serial via stdin/stdout.

Requirements

Option to spawn 1 process instead of 4 in ctx.action execution_requirements.
Ability to handle multiple requests simultaneously

Justification

JVM has high memory overhead.
CacheBuilder and SoftReference turn Java GC into super fast LRU cache for ASTs.
Why have 4 caches?

Design No. 1: Multiplex

Continue using stdin/stdout
Add request_id field to worker protocol request and response message protos
Add is_trying boolean field to response proto, sort of like 100 Trying in SIP
Bazel can send another request in parallel if it gets an is_trying response

Design No. 2: TCP

Add upgrade field to response proto that redirects Bazel to a HostAndPort.
All future requests get sent there via TCP
Don't use gRPC just send the raw protos
Maybe allow multiple requests per socket

CC: @lberki, @meistert with whom I socialized idea offline IIRC

P2 sandboxing feature request

Source

jart

👍8

All 32 comments

Don't use gRPC just send the raw protos
Maybe allow multiple requests per socket

Those two seem contradicting. That is, if you want multiple requests per socket you again need to implement something like gRPC.

buchgr on 18 Apr 2017

Do you hava data on the costs of not implementing this? Sure, it can be done, but it's yet another feature we have to support so I'd rather it pays for its rent.

lberki on 18 Apr 2017

👍1

Why not have a frontend-process, which delegates to one background-process? Why does it have to be implemented on the Bazel level?

abergmeier-dsfishlabs on 18 Apr 2017

Fun fact, a multiplexing TCP-based version of this was implemented in the very first version of persistent workers and worked perfectly fine with a multi-threaded version of the JavaBuilder worker - but was then deemed unnecessary complexity by me and teammates and the code was deleted (AFAIR without even submitting the CL, so we can't restore it from history, ouch) and replaced with the simpler, serialized, multi-process stdin/stdout mechanism. :| Maybe we should have gone with the more complex version in the first place. Hindsight is best sight.

I'll have a look at this! Thanks for writing this proposal down so cleanly.

philwo on 18 Apr 2017

@philwo My pleasure. Did you consider using the multiplexing technique described in Design No. 1? That would avoid the socket complexity and should hopefully be pretty straightforward. The user could continue doing things the simple way if he wants

@lberki It's far too easy to accidentally link the wrong thing in our internal repo and end up with so many jars that the JVM takes up gigs of memory. The JVM is amazing at threads and garbage collection so it makes sense to me to utilize those strengths, just like Bazel does.

@abergmeier-dsfishlabs The requests would still get sent to the frontend in parallel. Maybe the four frontends could scheme together to launch a single backend, possibly by locking a single input file, but I'm not sure if Bazel would consider that hermetic.

jart on 18 Apr 2017

Why not have a frontend-process, which delegates to one background-process? Why does it have to be implemented on the Bazel level?

Because that would be really easy to mess up. Who would start it? Who would stop it? Who would check to see if files have changed?

Do you hava data on the costs of not implementing this? Sure, it can be done, but it's yet another feature we have to support so I'd rather it pays for its rent.

Lucid Software has seen a massive memory regression in transitioning from sbt (Scala) to Bazel. sbt used a single JVM process, and so it could JIT the Scala compiler once and have reasonable memory overhead. Then Bazel says, "Hey, if you want the same performance you had before, take your machine apart cram a bunch more RAM into it, and start 8x the number of processes each doing the exact same JIT.

As @jart said, it's insane to have multiple local caches. And the only reason to use workers is to cache things, no? (Typically caching JIT, sometimes just caching loading the executable, and I suppose you can get fancier.) Is there any situation that wouldn't used significantly less memory with this proposal?

pauldraper on 4 Sep 2018

Is there any situation that wouldn't used significantly less memory with this proposal

For compilers that don't supported multi-threaded compilation it should neither be a win nor a miss. I remember talking to @philwo offline and I believe we agreed that it's a good idea to support your use case. Would you be interested in working on this @pauldraper ?

buchgr on 4 Sep 2018

For compilers that don't supported multi-threaded compilation

That's true. Say, Node.js-based compilers.

Would you be interested in working on this @pauldraper ?

I'm no longer working with this, but @jjudd may be interested.

pauldraper on 4 Sep 2018

My 2c though: I suppose TCP would help the #4897 issues. But I like the simplicity of stdin/stdout (even when multiplexed). And not fiddling with Nagle, etc.

FWIW, the apt transport protocol is multiplexed stdin/stdout with an executable.

pauldraper on 4 Sep 2018

We are definitely interested in this. Launching lots of JVMs consumes lots of resources.

I'm not sure when we will have time to work on it, but it is something we are interested in.

@buchgr do you have a ballpark estimate of how large of an effort you think developing this would be? I lack enough context on the Bazel codebase to effectively tell if it is a 2 day, 2 week, or 2 month effort.

jjudd on 4 Sep 2018

@buchgr do you have a ballpark estimate of how large of an effort you think developing this would be? I lack enough context on the Bazel codebase to effectively tell if it is a 2 day, 2 week, or 2 month effort.

As the original author of the workers feature @philwo should be able to answer this question best and provide guidance.

buchgr on 4 Sep 2018

/sub

jin on 4 Sep 2018

@philwo friendly ping. In your opinion how large of a task is this? Hours, days, weeks, months?

jjudd on 18 Sep 2018

I think it shouldn't take long - days for a first prototype, weeks for a first complete version maybe? This would only concern the Bazel side though, I can't speak about updating existing workers to take advantage of the new protocol.

All the worker related code in Bazel is concentrated here: https://source.bazel.build/bazel/+/master:src/main/java/com/google/devtools/build/lib/worker/ - so you don't need much context about how Bazel works.

There's an integration test, too: https://source.bazel.build/bazel/+/master:src/test/shell/integration/bazel_worker_test.sh

Regarding the protocol, I'm open to whatever you'd come up with that works well and is easily integrated into various languages out there. I think I've seen persistent workers written in Java, JavaScript, TypeScript, Dart so far.

@cushon (Java), @mprobst (TypeScript) and @davidmorgan (Dart) might want to comment on this with their ideas / wishes. :)

philwo on 27 Sep 2018

From the Dart side: parallel requests in one worker isn't super exciting since Dart is single threaded. We're planning on experimenting with build performance in Q4, we might have some suggestions for worker protocol changes of our own. (Not super high probability, though; 20% maybe).

davidmorgan on 27 Sep 2018

Same on the TypeScript side, our workers are necessarily single threaded,
so this wouldn't help us.

mprobst on 27 Sep 2018

@kevin1e100 might be interested in this for kotlin.

Javac is single-threaded, but I think this would allow us to run multiple instances of it in one worker and share a cache and memory footprint, instead of having multiple workers which starts to use a lot of memory and means any caching takes a long time to work up. How do Dart and TypeScript avoid those issues with the current approach?

cushon on 27 Sep 2018

Right even with single-threaded underlying tools, if they can safely be run in parallel, that can still be a win I would think. But from my point of view this really shines when the worker wants to do some kind of caching (example below) or incremental scheme (e.g., Java compilation is typically incremental in the Eclipse IDE IIUC). Bazel's DexBuilder worker for Android builds for instance uses caching but as @cushon mentioned all worker instances have their own cache, which can be unfortunate.

kevin1e100 on 27 Sep 2018

Thanks for the estimate @philwo. We are starting work on this. @borkaehw is leading the implementation our end. We'll keep people updated as we make progress, propose designs, etc.

jjudd on 28 Sep 2018

@jjudd it would be great if you could share a design document with bazel-discuss / bazel-dev before doing the implementation. We are happy to review it and give pointers! Thanks so much!

buchgr on 28 Sep 2018

@cushon users of bazel+workers+Dart are google internal--we just use a lot of RAM.

davidmorgan on 28 Sep 2018

Cc @johnynek since we (rules_scala) also use a persistent worker and I
think we’d love for this feature as well
On Fri, 28 Sep 2018 at 15:57 David Morgan notifications@github.com wrote:

@cushon https://github.com/cushon users of bazel+workers+Dart are
google internal--we just use a lot of RAM.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/2832#issuecomment-425427120,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUIF7DO1GwfJ7umqYyY3tmZVMlkkvR5ks5ufhydgaJpZM4M-s5j
.

ittaiz on 28 Sep 2018

@davidmorgan are you using the workers for caching/incrementality, or mostly to keep a VM warm? If you're using caching, have seen issues with the hit rate from having a separate cache for each worker instance?

cushon on 28 Sep 2018

@cushon Right now mostly to keep a VM warm. We hope to gain more from caching/incrementality in future.

davidmorgan on 29 Sep 2018

We have a design doc from Lucid Software, it basically follows Design No. 1: Multiplex. We would like to take feedback and open a PR after we address everyone's concerns.

https://docs.google.com/document/d/1OC0cVj1Y1QYo6n-wlK6mIwwG7xS2BJX7fuvf1r5LtnU/edit?usp=sharing

borkaehw on 26 Oct 2018

@borkaehw thank you for the doc! could you please read the Bazel Design Review process and follow the steps there? This will improve the visibility of the doc to the entire Bazel ecosystem (bazel-dev, proposals repository).

cc @laurentlb

jin on 26 Oct 2018

@jin thanks, I would love to.

borkaehw on 26 Oct 2018

@jin just want to make sure I understand the steps.

Create a PR in bazelbuild/proposals for a row under Under review.
Send a greetings in bazel-dev mailing list with the design doc and discussion thread (this thread).

I am not sure if we want to make a PR to bazelbuild/bazel since some features need to be decided and not yet implemented.

Do I miss anything?

borkaehw on 27 Oct 2018

Yes, please hold off creating a Bazel PR until your design has been
approved. The main discussion thread should be the announcement in the
bazel-dev mailing list, not this issue, but you can link to this issue
there.

On Fri, Oct 26, 2018 at 6:07 PM Bor Kae Hwang notifications@github.com
wrote:

@jin https://github.com/jin just want to make sure I understand the
steps.

Create a PR in bazelbuild/proposals for a row under Under review.

Send a greetings in bazel-dev mailing list with the design doc and
discussion thread (this thread).

I am not sure if we want to make a PR to bazelbuild/bazel since some
features need to be decided and not yet implemented.

Do I miss anything?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/2832#issuecomment-433556240,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAVPDvpqnJ1q319Jgh_FNIvNqq8Nm9zcks5uo4etgaJpZM4M-s5j
.

jin on 27 Oct 2018

@jin just completed the requirements. Let me know if further actions are needed. Thanks.

borkaehw on 29 Oct 2018

👍1

Link to the mailing list thread for those interested: https://groups.google.com/forum/#!topic/bazel-dev/bLoQsFpd2bA

jjudd on 29 Oct 2018

Just realized that this was never linked here. A prototype implementation that we've been using internally for 2-3 weeks now is in review here: #6857

jjudd on 19 Dec 2018

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

remote/feature: support AWS S3 as a remote caching backend

buchgr · 3Comments

repository_ctx.symlink on a large directory take a lot of time

alexandrvb · 3Comments

Generate a download list and design tooling to add mirror entries

ensonic · 3Comments

Error: LinkageError occurred while loading main class com.google.devtools.build.lib.bazel.Bazel java.lang.ExceptionInInitializerError: null

PanJinquan · 3Comments

Have a general mechanism for collecting code coverage

iirina · 3Comments