The question of why Bazel is written in Java has come up many times, and every time, the team has had to explain the same answers.
We should add an entry to the FAQ that clarifies this. In particular:
(I think someone proposed this idea during BazelCon but I can't remember who for credit. Sorry!)
Many people think they first need to install Java before being able to use Bazel.
e.g. https://twitter.com/Jakeherringbone/status/1053653932893908994
Julio: Since you started this, do you want to take a stab at writing something for the FAQs
bump... this is still a question that comes up quite often.
I remember the first version of bazel was written in Go ? may be I mistake it
Bazel (and Blaze before it) were always written in Java. The predecessor to Blaze was written in Python (and that's why Bazel is using a Python-like language for BUILD files - because they were originally python scripts). "Go was publicly announced in November 2009, and version 1.0 was released in March 2012." [Wikipedia]
Blaze was begun in ~2007 by @alandonovan - IMO, the first 'real' part of Blaze was the BUILD file parser. He can probably explain better than me why he chose Java at the time, but Go was probably not an option yet.
Google's previous build system was a Python program called gconfig that read BUILD files (which it loaded and executed as Python code) and wrote a Makefile, which was then executed by a thin wrapper around GNU Make. The Makefile would regenerate itself when the BUILD files changed---in theory. In reality, incremental builds often produced different results from clean builds, which was a huge waste of everyone's time.
There was a widespread desire for a more integrated implementation, so various interested people had a meeting some time in early 2006 to discuss what the next system should look like. There was no question that it had to be a typed language, which at that point meant C++ or Java. (Rob Pike was in the room. Sadly Go wasn't invented for four more years, as it would have been the ideal tool for the job. Google uses very little Rust, Scala, and Haskell.) If memory serves, Java won primarily because it had a garbage collector---and half the job of a build tool is concurrent operations on often-cyclic graphs. And I'm sure that at least in part it was because Java was the language Johannes Henkel and I, who started the project, were using at the time.
The other half of a job of a build tool is interacting with the operating system: reading files and directories, writing files, communicating over a network, and controlling other processes. In hindsight, the JVM was a poor choice for this work, and many of Blaze's problems stem from it. Most system calls are inaccessible. Strings use UTF-16, requiring twice as much space and expensive conversions at I/O boundaries. Its objects afford the user little control over memory layout, making it hard to design efficient core data structures, and no means of escape for performance-critical code. Also, compiling for the JVM is slow---surprisingly, slower than C++ or Go even though the compiler does less---yet the resulting code is also slow to start and slow to warm up, pushing CPU costs that should be borne by the developer onto the user. (Google runs Blaze's JVM on an 18-bit number of cores.) The JVM is opaque, noisy, and unpredictable, making it hard to get accurate CPU profiles, which are crucial to finding and optimizing the slow parts. The only thing I really like about the JVM is that it can run in a mode in which Object pointers occupy 32 bits but can address 32GB, which is a significant space saving in a pointer-heavy program.
We've talked about doing rewrites in another language, and even done some prototyping in Go, but the obstacle is always that users have little tolerance for breaking changes. A language like C++ or Java is relatively stable and well specified, so it is feasible to switch a project from one toolchain to another. Blaze has far more dark corners than C++ and is constantly changing. That's what motivated me to build go.starlark.net, a reference implementation of Starlark, to shine a light in the dark corners, starting at the bottom. Even if we never get to a point where we can do a complete rewrite, it is already paying dividends for code health.
“compiling for the JVM is slow---surprisingly, slower than C++ or Go even though the compiler does less” means that "compiling the java from source to class is slower than compiling the c/c++ to the obj" ?
“compiling for the JVM is slow---surprisingly, slower than C++ or Go even though the compiler does less” means that "compiling the java from source to class is slower than compiling the c/c++ to the obj" ?
Yes, at least with Blaze. In part it's because Blaze Java builds are so larded with user-defined annotation processors and ErrorProne checks. It's also because C++ compilation is immensely parallel: all compilations can be started at once because the header files exist ahead of time, whereas the Java compiler requires the output of compiling each previous dependency, leading to long critical paths. (The Go toolchain does this too, but its .a files summarize the types of transitive dependencies so you only need one .a per import versus one .jar per transitive dependency, which massively reduces the quantity of data in a build.) I should also note that Blaze's own build is slow for numerous reasons unrelated to the design of the JVM.
C++ modules may have broken that model again.
On Tue, Aug 18, 2020 at 3:43 PM alandonovan notifications@github.com
wrote:
“compiling for the JVM is slow---surprisingly, slower than C++ or Go even
though the compiler does less” means that "compiling the java from source
to class is slower than compiling the c/c++ to the obj" ?Yes, at least with Blaze. In part it's because Blaze Java builds are so
larded with user-defined annotation processors and ErrorProne checks. It's
also because C++ compilation is immensely parallel: all compilations can be
started at once because the header files exist ahead of time, whereas the
Java compiler requires the output of compiling each previous dependency,
leading to long critical paths. (The Go toolchain does this too, but its .a
files summarize the types of transitive dependencies so you only need one
.a per import versus one .jar per transitive dependency, which massively
reduces the quantity of data in a build.) I should also note that Blaze's
own build is slow for numerous reasons unrelated to the design of the JVM.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/bazelbuild/bazel/issues/6514#issuecomment-675486330,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABYD2YMC7NFD4UDPBIDMA63SBKAOLANCNFSM4F7KHY7A
.
C++ modules may have broken that model again.
...and the Go team is debating moving towards a Java-style approach to reduce the number of redundant copies of type information of one particularly notorious large Google package that nearly everything depends upon. What's old is new again...
the Java compiler requires the output of compiling each previous dependency, leading to long critical paths
The Java compilation model in Bazel shares some elements of the c++ and go ones.
We have the concept of Java 'header compilation', where there's a separate tool (turbine) that compiles sources to interface jars faster than javac, so the actions on the critical path for a chain of Java libraries are not running javac (except for the terminal one) and more of the javac actions can be parallelized. The characteristics end up being kind of like c++ with modules.
Also, the outputs of header compilation actions include a summary of their transitive dependencies (somewhat like go), so the classpath of downstream header compilation action only includes jars from their direct dependencies and not transitive ones.
What's old is new again...
In earlier versions of Blaze there was a tool that extracted source headers from Java libraries and then processed the transitive source headers in one action (so the ergonomics were more like c++ with headers), which theoretically ensured the critical path was always of length ~3. We stopped doing that in part because processing the transitive closure of source headers was very expensive (c++ headers have similar issues), and because it didn't work with annotation processing.
I've also looked at summarizing transitive symbols for use with javac, but it's easy to end up with more classpath data overall--there are fewer jars, but each one repackages a significant fraction of it's transitive closure. Instead, there are heuristics to omit jars from transitive dependencies we don't expect to need to keep the number of inputs more manageable.
could javac itself compiler the java files parallel? just as "javac -d . a.java b.java c.java d.java" and all the parallelism happens in javac itself. i can't find any implementation of this.
Most helpful comment
Google's previous build system was a Python program called gconfig that read BUILD files (which it loaded and executed as Python code) and wrote a Makefile, which was then executed by a thin wrapper around GNU Make. The Makefile would regenerate itself when the BUILD files changed---in theory. In reality, incremental builds often produced different results from clean builds, which was a huge waste of everyone's time.
There was a widespread desire for a more integrated implementation, so various interested people had a meeting some time in early 2006 to discuss what the next system should look like. There was no question that it had to be a typed language, which at that point meant C++ or Java. (Rob Pike was in the room. Sadly Go wasn't invented for four more years, as it would have been the ideal tool for the job. Google uses very little Rust, Scala, and Haskell.) If memory serves, Java won primarily because it had a garbage collector---and half the job of a build tool is concurrent operations on often-cyclic graphs. And I'm sure that at least in part it was because Java was the language Johannes Henkel and I, who started the project, were using at the time.
The other half of a job of a build tool is interacting with the operating system: reading files and directories, writing files, communicating over a network, and controlling other processes. In hindsight, the JVM was a poor choice for this work, and many of Blaze's problems stem from it. Most system calls are inaccessible. Strings use UTF-16, requiring twice as much space and expensive conversions at I/O boundaries. Its objects afford the user little control over memory layout, making it hard to design efficient core data structures, and no means of escape for performance-critical code. Also, compiling for the JVM is slow---surprisingly, slower than C++ or Go even though the compiler does less---yet the resulting code is also slow to start and slow to warm up, pushing CPU costs that should be borne by the developer onto the user. (Google runs Blaze's JVM on an 18-bit number of cores.) The JVM is opaque, noisy, and unpredictable, making it hard to get accurate CPU profiles, which are crucial to finding and optimizing the slow parts. The only thing I really like about the JVM is that it can run in a mode in which Object pointers occupy 32 bits but can address 32GB, which is a significant space saving in a pointer-heavy program.
We've talked about doing rewrites in another language, and even done some prototyping in Go, but the obstacle is always that users have little tolerance for breaking changes. A language like C++ or Java is relatively stable and well specified, so it is feasible to switch a project from one toolchain to another. Blaze has far more dark corners than C++ and is constantly changing. That's what motivated me to build go.starlark.net, a reference implementation of Starlark, to shine a light in the dark corners, starting at the bottom. Even if we never get to a point where we can do a complete rewrite, it is already paying dividends for code health.