Bazel: Persistent test execution worker (TestRunner supporting worker strategy)

Created on 1 Mar 2019 · 14Comments · Source: bazelbuild/bazel

Description of the problem / feature request:

In our bazel conversion experiment we have moved a small subset of apps, libraries, and tests over to bazel. In running head-to-head in CI, we get nearly 2x wall time on bazel. Some of that we can optimize away, however we end up with 73 of 75 tests running, on BUCK, in under 100ms, whereas bazel runs them in 1-3 seconds each. This leads to 3 minutes of Buck, 3 minutes of Gradle, and 6 minutes of Bazel executing the same build set.

While we have not fully tweaked all the optimization settings (some bits of the build we need make use workers, and sandboxing is expensive) our investigation accounted for that - the biggest cost is in test execution taking around 10x per test on Bazel vs. Gradle or Buck in our situation.

A few relevant pieces of the puzzle include:

When we create an "AllTest" suite (using @RunWith(Suite.class) and @Suite.SuiteClasses({...}),
the suite runs all hte test code (confirmed by forcing failures first throughout the tests), and it
runs in vastly less time than the sum of all the executions. In an example run, AllTest executed
16 tests' worth of code in 2.7s, while all 16 individually took between .8 and 1.9 seconds.
This "AllTest" suite only incurs the penalty of test runner setup once, for N tests, and that made
a huge difference.
These tests were parallelized, and the 2.7 seconds above was close to the in the small build,
but in a clean build these times were closer to 0.9-3.2secs, and when all 75 tests are run, the
tests got slower individually, by a minimum of .4 seconds, and the whole build lengthened
considerably disproportionately. Somewhat this is limited by the number of worker processes,
but neither Buck nor Gradle hit this, and their test times were still sub-second for nearly all
of them.
In the trace profiles, test execution is not further broken down, so there's little ability to
determine how much time is being taken by the test execution itself, except by inference as
above.
When building an APK, bazel's times were quite competitive, faster than Gradle, and comparable
to Buck. It's only in the test scenario.

Feature requests: what underlying problem are you trying to solve with this feature?

Implement persistent worker support for the TestRunner, to avoid the setup/teardown costs associated with invoking a new TestRunner.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Make a project, make a bunch of tests. I don't have a repro setup at present, but will be working one up.

What operating system are you running Bazel on?

CentOS and MacOS (different numbers, same proportions)

What's the output of `bazel info release`?

INFO: Invocation ID: 329bc936-43a5-48ba-b3b3-17ea3f158122
release 0.22.0

Have you found anything relevant by searching the web?

In looking, there have been discussions of experimental persistent worker supporting TestRunner, but the code seems to have been deleted, and none of the instructions work anymore.

Any other information, logs, or outputs that you want to share?

Example (redacted) test run iwth 16 tests.

internal:AllTest          PASSED in 2.7s
internal:01Test    PASSED in 0.8s
internal:02Test  PASSED in 1.4s
internal:03Test PASSED in 1.9s
internal:04Test     PASSED in 1.4s
internal:05Test   PASSED in 1.3s
internal:06Test PASSED in 1.0s
internal:07Test PASSED in 1.7s
internal:08Test          PASSED in 0.9s
internal:09Test PASSED in 1.9s
internal:10Test PASSED in 1.7s
internal:11Test PASSED in 1.5s
internal:12Test PASSED in 1.0s
internal:13Test           PASSED in 1.4s
internal:14Test PASSED in 1.9s
internal:15Test PASSED in 0.8s
internal:16Test      PASSED in 1.6s

Buck equivalent (didn't run the AllTest in this codebase)

[2019-02-26 02:29:04] PASS    <100ms  3 Passed   0 Skipped   0 Failed   01Test
[2019-02-26 02:29:04] PASS    <100ms  8 Passed   0 Skipped   0 Failed   02Test
[2019-02-26 02:29:04] PASS    <100ms  7 Passed   0 Skipped   0 Failed   03Test
[2019-02-26 02:29:04] PASS    <100ms 12 Passed   0 Skipped   0 Failed   04Test
[2019-02-26 02:29:04] PASS    <100ms 16 Passed   0 Skipped   0 Failed   05Test
[2019-02-26 02:29:04] PASS    <100ms 10 Passed   0 Skipped   0 Failed   06Test
[2019-02-26 02:29:04] PASS     109ms  9 Passed   0 Skipped   0 Failed   07Test
[2019-02-26 02:29:04] PASS    <100ms 14 Passed   0 Skipped   0 Failed   08Test
[2019-02-26 02:29:04] PASS    <100ms 20 Passed   0 Skipped   0 Failed   09Test
[2019-02-26 02:29:04] PASS    <100ms 10 Passed   0 Skipped   0 Failed   10Test
[2019-02-26 02:29:04] PASS    <100ms  2 Passed   0 Skipped   0 Failed   11Test
[2019-02-26 02:29:04] PASS    <100ms  9 Passed   0 Skipped   0 Failed   12Test
[2019-02-26 02:29:04] PASS     120ms  9 Passed   0 Skipped   0 Failed   13Test
[2019-02-26 02:29:04] PASS    <100ms 18 Passed   0 Skipped   0 Failed  14Test
[2019-02-26 02:29:04] PASS    <100ms  2 Passed   0 Skipped   0 Failed   15Test
[2019-02-26 02:29:04] PASS     148ms 30 Passed   0 Skipped   0 Failed   16Test

P3 team-Local-Exec feature request

Source

cgruber

👍5

All 14 comments

Interestingly, there was an implementation of this but then it was deleted (0c9f2d4c15b761e3f3b863658b6d5c65bde6db22).

benjaminp on 1 Mar 2019

👍1

/cc @meisterT

irengrig on 1 Mar 2019

Worth noting, one mitigation is auto-generation of per-package or per-top-level-package "AllTest" classes that contain @Suite stuff, and replacing any other java_test statements. That definitely reduces test execution times, but at the cost of smooshing together all the dependencies and removing any ability to do "affected test" filtering.

cgruber on 2 Mar 2019

@meisterT Hmm, I was about to move this to team-Performance but you have just done the opposite. I think I heard from you that test setup is one of the major penalties we have now, performance-wise? Also, while this sounds like a "local execution request", is it really? I mean, is there something to change in the worker code in Bazel or what we really have to do is modify the workers themselves to support this request? Or maybe I don't understand the goal of team-Performance properly...

jmmv on 14 Mar 2019

Pinging this again. Now that we have thousands of tests in our corpus, doing a per-package aggregate generated test suite (so as to run all junit test cases in one single test target) has shaved about 1/4-1/2 of our build time off, just by itself. Not having some way to avoid the extra tax of non-persistent worker invocation is a pretty big deal, when you don't have a build farm.

cgruber on 29 May 2019

To give it some meat, an example run on a commit from yesterday (doing full builds, not "affected test" query magic) did this:

| | aggregate | individual |
|---|---|---|
| count | 396 | 1826 |
| nocache | 01:02:32 | 02:00:12 (timeout) |
| cache | 49:03 | 01:49:04 |

Now, times vary a lot, and we're working to reduce the deviation, but this is representative of build times, on these machines.

cgruber on 29 May 2019

Pinging @lberki, author of https://github.com/bazelbuild/bazel/commit/0c9f2d4c15b761e3f3b863658b6d5c65bde6db22 - do you have more context / background on why the experiment didn't work out?

jin on 29 May 2019

Huh, that was a while ago... more context is on (internal) b/25006099, but the short version is that it proved to be difficult to be both correct enough and fast enough. The following issues come to mind:

There were classes that are are common dependencies of the test runner and the code under test (think Guava) so separating them was difficult
It proved to be hard to "purge" the state of the JVM from previous test runs (although we could probably have tried harder with classloading magic)
A lot of time is spent in JIT compiling the code under test, which you can't avoid on each test run if you want to be correct

If I had to try again, I would try either jury-rigging something with native-image or CRIU so that the JVM startup and the test runner is AOT compiled, then dynamically load the actual code under test (handwave-handwave). That way, we'd get easily provable correctness without incurring (most of the) overhead.

That wouldn't help with JIT compiling potentially large amounts of code under test, though.