Bazel: Multiplex persistent worker protocol

Created on 16 Apr 2017  Â·  32Comments  Â·  Source: bazelbuild/bazel

Background

Bazel spawns 4 persistent workers processes and sends them requests in serial via stdin/stdout.

Requirements

  • Option to spawn 1 process instead of 4 in ctx.action execution_requirements.
  • Ability to handle multiple requests simultaneously

Justification

  • JVM has high memory overhead.
  • CacheBuilder and SoftReference turn Java GC into super fast LRU cache for ASTs.
  • Why have 4 caches?

Design No. 1: Multiplex

  • Continue using stdin/stdout
  • Add request_id field to worker protocol request and response message protos
  • Add is_trying boolean field to response proto, sort of like 100 Trying in SIP
  • Bazel can send another request in parallel if it gets an is_trying response

Design No. 2: TCP

  • Add upgrade field to response proto that redirects Bazel to a HostAndPort.
  • All future requests get sent there via TCP
  • Don't use gRPC just send the raw protos
  • Maybe allow multiple requests per socket

CC: @lberki, @meistert with whom I socialized idea offline IIRC

P2 sandboxing feature request

All 32 comments

Don't use gRPC just send the raw protos
Maybe allow multiple requests per socket

Those two seem contradicting. That is, if you want multiple requests per socket you again need to implement something like gRPC.

Do you hava data on the costs of not implementing this? Sure, it can be done, but it's yet another feature we have to support so I'd rather it pays for its rent.

Why not have a frontend-process, which delegates to one background-process? Why does it have to be implemented on the Bazel level?

Fun fact, a multiplexing TCP-based version of this was implemented in the very first version of persistent workers and worked perfectly fine with a multi-threaded version of the JavaBuilder worker - but was then deemed unnecessary complexity by me and teammates and the code was deleted (AFAIR without even submitting the CL, so we can't restore it from history, ouch) and replaced with the simpler, serialized, multi-process stdin/stdout mechanism. :| Maybe we should have gone with the more complex version in the first place. Hindsight is best sight.

I'll have a look at this! Thanks for writing this proposal down so cleanly.

@philwo My pleasure. Did you consider using the multiplexing technique described in Design No. 1? That would avoid the socket complexity and should hopefully be pretty straightforward. The user could continue doing things the simple way if he wants

@lberki It's far too easy to accidentally link the wrong thing in our internal repo and end up with so many jars that the JVM takes up gigs of memory. The JVM is amazing at threads and garbage collection so it makes sense to me to utilize those strengths, just like Bazel does.

@abergmeier-dsfishlabs The requests would still get sent to the frontend in parallel. Maybe the four frontends could scheme together to launch a single backend, possibly by locking a single input file, but I'm not sure if Bazel would consider that hermetic.

Why not have a frontend-process, which delegates to one background-process? Why does it have to be implemented on the Bazel level?

Because that would be really easy to mess up. Who would start it? Who would stop it? Who would check to see if files have changed?

Do you hava data on the costs of not implementing this? Sure, it can be done, but it's yet another feature we have to support so I'd rather it pays for its rent.

Lucid Software has seen a massive memory regression in transitioning from sbt (Scala) to Bazel. sbt used a single JVM process, and so it could JIT the Scala compiler once and have reasonable memory overhead. Then Bazel says, "Hey, if you want the same performance you had before, take your machine apart cram a bunch more RAM into it, and start 8x the number of processes each doing the exact same JIT.

As @jart said, it's insane to have multiple local caches. And the only reason to use workers is to cache things, no? (Typically caching JIT, sometimes just caching loading the executable, and I suppose you can get fancier.) Is there any situation that wouldn't used significantly less memory with this proposal?

Is there any situation that wouldn't used significantly less memory with this proposal

For compilers that don't supported multi-threaded compilation it should neither be a win nor a miss. I remember talking to @philwo offline and I believe we agreed that it's a good idea to support your use case. Would you be interested in working on this @pauldraper ?

For compilers that don't supported multi-threaded compilation

That's true. Say, Node.js-based compilers.

Would you be interested in working on this @pauldraper ?

I'm no longer working with this, but @jjudd may be interested.

My 2c though: I suppose TCP would help the #4897 issues. But I like the simplicity of stdin/stdout (even when multiplexed). And not fiddling with Nagle, etc.

FWIW, the apt transport protocol is multiplexed stdin/stdout with an executable.

We are definitely interested in this. Launching lots of JVMs consumes lots of resources.

I'm not sure when we will have time to work on it, but it is something we are interested in.

@buchgr do you have a ballpark estimate of how large of an effort you think developing this would be? I lack enough context on the Bazel codebase to effectively tell if it is a 2 day, 2 week, or 2 month effort.

@buchgr do you have a ballpark estimate of how large of an effort you think developing this would be? I lack enough context on the Bazel codebase to effectively tell if it is a 2 day, 2 week, or 2 month effort.

As the original author of the workers feature @philwo should be able to answer this question best and provide guidance.

/sub

@philwo friendly ping. In your opinion how large of a task is this? Hours, days, weeks, months?

I think it shouldn't take long - days for a first prototype, weeks for a first complete version maybe? This would only concern the Bazel side though, I can't speak about updating existing workers to take advantage of the new protocol.

All the worker related code in Bazel is concentrated here: https://source.bazel.build/bazel/+/master:src/main/java/com/google/devtools/build/lib/worker/ - so you don't need much context about how Bazel works.

There's an integration test, too: https://source.bazel.build/bazel/+/master:src/test/shell/integration/bazel_worker_test.sh

Regarding the protocol, I'm open to whatever you'd come up with that works well and is easily integrated into various languages out there. I think I've seen persistent workers written in Java, JavaScript, TypeScript, Dart so far.

@cushon (Java), @mprobst (TypeScript) and @davidmorgan (Dart) might want to comment on this with their ideas / wishes. :)

From the Dart side: parallel requests in one worker isn't super exciting since Dart is single threaded. We're planning on experimenting with build performance in Q4, we might have some suggestions for worker protocol changes of our own. (Not super high probability, though; 20% maybe).

Same on the TypeScript side, our workers are necessarily single threaded,
so this wouldn't help us.

@kevin1e100 might be interested in this for kotlin.

Javac is single-threaded, but I think this would allow us to run multiple instances of it in one worker and share a cache and memory footprint, instead of having multiple workers which starts to use a lot of memory and means any caching takes a long time to work up. How do Dart and TypeScript avoid those issues with the current approach?

Right even with single-threaded underlying tools, if they can safely be run in parallel, that can still be a win I would think. But from my point of view this really shines when the worker wants to do some kind of caching (example below) or incremental scheme (e.g., Java compilation is typically incremental in the Eclipse IDE IIUC). Bazel's DexBuilder worker for Android builds for instance uses caching but as @cushon mentioned all worker instances have their own cache, which can be unfortunate.

Thanks for the estimate @philwo. We are starting work on this. @borkaehw is leading the implementation our end. We'll keep people updated as we make progress, propose designs, etc.

@jjudd it would be great if you could share a design document with bazel-discuss / bazel-dev before doing the implementation. We are happy to review it and give pointers! Thanks so much!

@cushon users of bazel+workers+Dart are google internal--we just use a lot of RAM.

Cc @johnynek since we (rules_scala) also use a persistent worker and I
think we’d love for this feature as well
On Fri, 28 Sep 2018 at 15:57 David Morgan notifications@github.com wrote:

@cushon https://github.com/cushon users of bazel+workers+Dart are
google internal--we just use a lot of RAM.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/2832#issuecomment-425427120,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABUIF7DO1GwfJ7umqYyY3tmZVMlkkvR5ks5ufhydgaJpZM4M-s5j
.

@davidmorgan are you using the workers for caching/incrementality, or mostly to keep a VM warm? If you're using caching, have seen issues with the hit rate from having a separate cache for each worker instance?

@cushon Right now mostly to keep a VM warm. We hope to gain more from caching/incrementality in future.

We have a design doc from Lucid Software, it basically follows Design No. 1: Multiplex. We would like to take feedback and open a PR after we address everyone's concerns.

https://docs.google.com/document/d/1OC0cVj1Y1QYo6n-wlK6mIwwG7xS2BJX7fuvf1r5LtnU/edit?usp=sharing

@borkaehw thank you for the doc! could you please read the Bazel Design Review process and follow the steps there? This will improve the visibility of the doc to the entire Bazel ecosystem (bazel-dev, proposals repository).

cc @laurentlb

@jin thanks, I would love to.

@jin just want to make sure I understand the steps.

  1. Create a PR in bazelbuild/proposals for a row under Under review.
  2. Send a greetings in bazel-dev mailing list with the design doc and discussion thread (this thread).

I am not sure if we want to make a PR to bazelbuild/bazel since some features need to be decided and not yet implemented.

Do I miss anything?

Yes, please hold off creating a Bazel PR until your design has been
approved. The main discussion thread should be the announcement in the
bazel-dev mailing list, not this issue, but you can link to this issue
there.

On Fri, Oct 26, 2018 at 6:07 PM Bor Kae Hwang notifications@github.com
wrote:

@jin https://github.com/jin just want to make sure I understand the
steps.

  1. Create a PR in bazelbuild/proposals for a row under Under review.
  2. Send a greetings in bazel-dev mailing list with the design doc and
    discussion thread (this thread).

I am not sure if we want to make a PR to bazelbuild/bazel since some
features need to be decided and not yet implemented.

Do I miss anything?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/2832#issuecomment-433556240,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAVPDvpqnJ1q319Jgh_FNIvNqq8Nm9zcks5uo4etgaJpZM4M-s5j
.

@jin just completed the requirements. Let me know if further actions are needed. Thanks.

Link to the mailing list thread for those interested: https://groups.google.com/forum/#!topic/bazel-dev/bLoQsFpd2bA

Just realized that this was never linked here. A prototype implementation that we've been using internally for 2-3 weeks now is in review here: #6857

Was this page helpful?
0 / 5 - 0 ratings

Related issues

iirina picture iirina  Â·  3Comments

buchgr picture buchgr  Â·  3Comments

meteorcloudy picture meteorcloudy  Â·  3Comments

f1recracker picture f1recracker  Â·  3Comments

GaofengCheng picture GaofengCheng  Â·  3Comments