Drake: ci: kcov runs are super slow (~12hrs)

Created on 7 Feb 2019  路  15Comments  路  Source: RobotLocomotion/drake

At present, kcov for code coverage builds are slow (~12hrs).

~May need tweaking to pipeline to take full advantage of parallelization.~ EDIT(2020-11-03) Don't even know what I meant by this.

(From Kitware notes doc.)

build system bazel low kitware

All 15 comments

We're starting to trip the 12h time limit on CI. We'll have to either bump the limit, or make it faster.

I wonder if we could write kcov output to ${TEST_UNDECLARED_OUTPUTS_DIR}/kcov, and then kcov --merge it after the fact. That would let us run tests in parallel (--jobs=12) in CI, instead of one-at-a-time like we do now:

https://github.com/RobotLocomotion/drake/blob/96e0cc0b94926f0725ba69f544fda6e9200b216c/tools/dynamic_analysis/bazel.rc#L10

@rpoyner-tri Do you have any interest in taking on a fix?

I'll have a look.

What happens to the coverage reports now? Are they accessible anywhere, or do they just go down the memory hole when the CI workspace is wiped?

From local experiments, the merge strategy should work fine. I have some questions still about CI integration and advice for local/developer use.

Given this patch to drake:

diff --git a/tools/dynamic_analysis/bazel.rc b/tools/dynamic_analysis/bazel.rc
index a10b0fbf0..0aab60c6c 100644
--- a/tools/dynamic_analysis/bazel.rc
+++ b/tools/dynamic_analysis/bazel.rc
@@ -7,7 +7,7 @@ build:kcov --copt -g
 build:kcov --copt -O0
 build:kcov --spawn_strategy=standalone
 build:kcov --run_under //tools/dynamic_analysis:kcov
-build:kcov --local_test_jobs=1
+build:kcov --local_test_jobs=HOST_CPUS-1
 build:kcov --test_tag_filters=-lint,-gurobi,-mosek,-snopt,-no_kcov
 build:kcov --nocache_test_results
 # These increased timeouts were set through experimentation. Because kcov runs
diff --git a/tools/dynamic_analysis/kcov.sh b/tools/dynamic_analysis/kcov.sh
index a35e2cea0..2e9e59962 100755
--- a/tools/dynamic_analysis/kcov.sh
+++ b/tools/dynamic_analysis/kcov.sh
@@ -16,6 +16,6 @@ kcov \
     "--include-path=${WORKSPACE}" \
     --verify \
     --exclude-pattern=third_party \
-    "${WORKSPACE}/bazel-kcov" \
+    "${TEST_UNDECLARED_OUTPUTS_DIR}/kcov" \
     "--replace-src-path=/proc/self/cwd:${WORKSPACE}" \
     "$@"

All that would remain is to document the merge step for developer use and integrate it into CI. The merge step looks like this:

$ kcov --merge [OUTPUT-DIR] $(find $(bazel info bazel-testlogs) -type d -name kcov)

kcov will create the output directory if it doesn't exist. The developer steps could rely on some shell script to simplify the workflow. I assume the CI integration would require cmake surgery in drake-ci; requesting @jamiesnape advice for that.

The last detail is to remove any references to a bazel-kcov directory, as it would no longer be a magic location.

For drake-ci changes, I think once we have a drake PR reviewed with all of the other changes (to settings, developer docs, etc.) we can ping Jamie and provide the recipe for the new execute_process(COMMAND ...) that we want inserted into drake-ci. We can even open a drake-ci PR with that change, but I don't think there's an easy way to test when combined with the drake PR.

Aside: I think we could remove the local_test_jobs line entirely, instead of altering it.

Corner cases:

  • I haven't gamed out what bazel does with undeclared-outputs data from prior build cycles; do we risk merging stale data?
  • I haven't gamed out what kcov does with coverage data that may already be in the destination directory. I assume it merges when it finds compatibility, and overwrites otherwise. I haven't verified that assumption.

Running a larger scale test on my puget, I'm seeing massive widespread occurrence of test timeouts; needs more investigation.

Yes the test timeouts effect is real. Likely cause is that each kcov-instrumented test runs as two communicating processes: a kcov parent and the test program as a child process.

Will recommend something like --localtest_jobs=HOST_CPUS*0.5 (which is a totally legal input -- nice work, bazel).

Nope, even half the cpus on a puget still leads to a rash of timeouts. Will look at this all more closely.

The empirical numbers here presumably need adjusting given the newly-introduced overhead from multi-process scheduling:
https://github.com/RobotLocomotion/drake/blob/4cdc13180de5fb323ef97310704950648d4bfa05/tools/dynamic_analysis/bazel.rc#L13-L17

Agreed. fiddling with that now.

Checklist for integration:

  • [x] merge the drake-ci changes: https://github.com/RobotLocomotion/drake-ci/pull/116. Should not yet affect execution.
  • [x] merge the drake-ci changes: https://github.com/RobotLocomotion/drake-ci/pull/117. Ugh, fixes to 116 above.
  • [x] merge the drake changes: #14294. Should enable the new execution schedule.
  • [x] tighten up kcov_tool ci_merge error checking to fail if it finds no data. This was deliberately relaxed to smooth the multi-repo transition.
  • [x] remove the presence test for kcov_tool from the drake-ci code. That should finish the removal of integration shims.

With #14294 merged now, all six flavors of nightly CI coverage builds ran for 1-3 hours each last night, and passed, and produced sensible results.

Tick box items are done. Closing.

Was this page helpful?
0 / 5 - 0 ratings