Tooling for coverage fails when running under remote execution (LcovMerger specifically)
Run a test target with bazel coverage --spawn_strategy=remote --remote_executor=... //path/to:test
Ubuntu 14.04
bazel info release
?release 0.10.1
Relevant portion of coverage execution
+ exec bazel-out/k8-fastbuild/bin/external/bazel_tools/tools/test/LcovMerger --coverage_dir=%execroot%/_coverage/path/to/test/test --output_file=%execroot%/bazel-out/k8-fastbuild/testlogs/path/to/test/coverage.dat
%execroot%/bazel-out/k8-fastbuild/bin/external/bazel_tools/tools/test/LcovMerger: Cannot locate runfiles directory. (Set $JAVA_RUNFILES to inhibit searching.)
Is this bug still an issue?cc @lberki @iirina
I see the same issue when running some of our C++/Python tests with remote execution. Bazel version is a slightly patched version of 0.25.3.
I've come up with a workaround that basically involves the below patch. I have a feeling there is a more correct way to do this--possibly by including some implicit dependencies in the relevant coverage-aware Bazel rules, but I am not familiar enough with the codebase to efficiently tackle that.
commit ba16394fa207709b2b2a2fdc22de30cc72cf4398 (HEAD -> rbe-coverage)
Author: Victor Robertson <[email protected]>
Date: Wed Sep 11 23:01:49 2019 -0700
Fix coverage for RBE [EP-12079]
This patch fixes code coverage in RBE by removing the distinction
between JAVA_RUNFILES and TEST_SRCDIR in the collect_coverage.sh script.
This is important as Bazel will not send JAVA_RUNFILES to the remote
environment. Unfortunately, this means that a Java sdk will be included
with every coverage test--perhaps this isn't so bad.
This also means that tests with coverage support must include the
dependencies normally provided by Bazel. In cruise/cruise, we use the
following additional dependencies:
- @embedded_jdk//:jdk
- @bazel_tools//tools/jdk:JacocoCoverageRunner
- @bazel_tools//tools/test/CoverageOutputGenerator/java/com/google/devtools/coverageoutputgenerator:Main
diff --git a/tools/test/collect_coverage.sh b/tools/test/collect_coverage.sh
index a9e212ce5d..df3b22cc78 100755
--- a/tools/test/collect_coverage.sh
+++ b/tools/test/collect_coverage.sh
@@ -188,7 +188,8 @@ if [[ $DISPLAY_LCOV_CMD ]] ; then
echo "-----------------"
fi
-# JAVA_RUNFILES is set to the runfiles of the test, which does not necessarily
-# contain a JVM (it does only if the test has a Java binary somewhere). So let
-# the LCOV merger discover where its own runfiles tree is.
-JAVA_RUNFILES= exec $LCOV_MERGER_CMD
+# cruise: Since JAVA_RUNFILES would never be pushed to RBE, we allow it to
+# remain as it is and ensure that TEST_SRCDIR has the right dependencies to run
+# coverage reports. Normally this would be preceeded with JAVA_RUNFILES= to
+# clear it out.
+exec $LCOV_MERGER_CMD
As mentioned in the patch, this means our test configurations include this little kludge. You can imagine how this is used (data = add_coverage_test_runtime_data()
) along with the associated config.
def add_coverage_test_runtime_data():
return select({
"@//tools:is_coverage_build": [
"@embedded_jdk//:jdk",
"@bazel_tools//tools/jdk:JacocoCoverageRunner",
"@bazel_tools//tools/test/CoverageOutputGenerator/java/com/google/devtools/coverageoutputgenerator:Main",
],
"//conditions:default": [],
})
Hopefully this information helps.
The test action only stages the files-to-build, but not the runfiles tree. The coverage tool used by Google is a single, self-contained executable file. This is at odds with the new coverage tool, which is built as a Java binary, and the wrapper script for java_binary expects a runfiles tree. This happens to work locally because the wrapper script manages to escape the symlink sandbox by following a symlink back to the exec root (it wouldn't work with stricter local sandboxing).
I'm still not sure how best to fix it, but I can improve the situation. I'm reluctant to change how the test action stages the coverage tool. However, I have two changes to make the coverage tools repository self-contained (this is a nice cleanup in any case and makes it more flexible in the future), and I have a prototype for making the coverage tools repository compatible with the test action setup by making it work without a runfiles tree. That works on Linux and MacOS because it currently assumes the presence of /bin/bash.
I'm still not sure how best to fix it, but I can improve the situation. I'm reluctant to change how the test action stages the coverage tool.
Why the reluctance? Wouldn't adding the coverage binary properly as a tool to the test action (i.e., fix #4033) be the cleanest resolution?
Definitely maybe. Merging the runfiles tree into the tests runfiles tree can potentially cause conflicts, and we also don't want to tests to interact with it directly - both the tool and the contract are subject to change. We could stage the runfiles tree separately from the test runfiles tree, although I suspect that that isn't possible in Bazel right now. At the same time, the tool is a single self-contained deploy jar - it's only the java_binary wrapper script that needs a runfiles tree, not the tool itself.
The JVM could be in runfiles, so I don't think they're trivial to dispense with for the coverge merger.
I suppose you could split the coverage merging into a separate spawn like test.xml
generation, but the test itself would need to glob all the intermediate coverage artifacts into a zip or something, which is pointless overhead.
Actually, I discussed splitting it into a separate spawn earlier today, and I think that would have some advantages. We are already close to the per-action execution time limit in some cases, and doing coverage work in the same action is problematic. We can use tree artifacts to track the intermediate artifacts to avoid zipping / unzipping.
You're right about the JVM.
I'm wondering if moving the lcov merger to a post-process would allow us to use an aspect to attach it to the test. We currently require that all test rules depend on the lcov_merger in order to get the postprocessing into the test action, and that requires extra work from rule authors and is also not documented anywhere.
That will probably fix #6293, too.
Fully integrating Go coverage requires some path to generate lcov data. However, I agree that moving to a post-process (aspect or no) would make it more flexible; it allows pulling back the coverage data even if the coverage tools for language X don't support lcov conversion yet (or ever). There's a similar problem with Python coverage which I looked into today - coverage.py doesn't support generating lcov, so right now it is impossible to integrate into Bazel's coverage system, and you can't even manually post-process because the files are silently dropped.
I'll look into declaring a tree artifact for the coverage output dir, if I find the time.
I was just hoping to not have to to pass --test_env=LCOV_MERGER=/bin/true
.
It looks like it works, and it makes it easier to debug. Unfortunately, the shell wrappers put a bunch of unrelated stuff into the _coverage directory. :-/
As a workaround, you can do the following (currently requires Bazel built from HEAD):
bazel build //tools/test/CoverageOutputGenerator/java/com/google/devtools/coverageoutputgenerator:coverage_output_generator.zip
unzip path/to/clone/bazel-bin/tools/test/CoverageOutputGenerator/java/com/google/devtools/coverageoutputgenerator/coverage_output_generator.zip
bazel coverage //java/foo:foo_test --config=remote --override_repository=remote_coverage_tools=/path/to/unzipped/coverage/output/generator/
Alternatively, you can pull back the full coverage outputs and manually post-process them:
bazel coverage //java/foo:foo_test --config=remote --experimental_fetch_all_coverage_outputs
The downside of this second approach is that it also pulls back a bunch of (potentially) large, unnecessary files.
It seems like it should be possible to move the coverage merging to a post-process using the coverage tree artifact as an intermediary.
Ulf, do you plan to do anything here?
I tried the --exerimental_fetch_all_coverage_outputs
flag and I just get:
<snip>/_coverage/gcov is a symbolic link to an absolute path /usr/bin/gcov. Symlinks to absolute paths in action outputs are not supported.
@finn-ball Bazel puts a bunch of files into what should be the coverage output directory, and one of them is a symlink. However, I am under the impression that this error message comes from the remote cache, which would imply that you're using the flag for local execution. Do you see problems with coverage in local execution?
(When I wrote the flag, I was planning to gradually clean up the coverage runner to avoid putting these files in the coverage directory.)
With the --experimental_fetch_all_coverage_outputs
:
Remote execution works with the:
--test_env=LCOV_MERGER=/bin/true
However, I am under the impression that this error message comes from the remote cache, which would imply that you're using the flag for local execution. Do you see problems with coverage in local execution?
@ulfjack - By using remote-execution, am I not also using the remote cache anyway?
Ah, I see where that comes from - the remote executor checks the symlinks it gets from the remote system before it re-creates them locally, and it causes that error message.
Indeed. If I use --test_env=LCOV_MERGER=/bin/true
, then my coverage.dat
files are returned but they are empty. Are there any other ways around this?
This has been opened for more than 2 years now. Can鈥檛 this be escalated a bit?
I'm working on this. Hopefully, I have a fix soon.
Actually, I discussed splitting it into a separate spawn earlier today, and I think that would have some advantages. We are already close to the per-action execution time limit in some cases, and doing coverage work in the same action is problematic. We can use tree artifacts to track the intermediate artifacts to avoid zipping / unzipping.
You're right about the JVM.
Hey Ulf, it looks like I have something that is working after splitting coverage post-processing into its own spawn. But now I was wondering why your initial thought was to split it into a different spawn in the same action and not a new separate action altogether.
Good point. I think I simply didn't consider it. Putting it in a separate action definitely has advantages.
@oquenchil - excellent! Have you merged your fix? I just tried this again with bazel v3.7.2 and I still get the same errors as before.
You have to use the flag --experimental_split_coverage_postprocessing. See this test as an example.
Most helpful comment
I'm working on this. Hopefully, I have a fix soon.