Drake: ci: kcov runs are super slow (~12hrs)

Created on 7 Feb 2019 · 15Comments · Source: RobotLocomotion/drake

At present, kcov for code coverage builds are slow (~12hrs).

~May need tweaking to pipeline to take full advantage of parallelization.~ EDIT(2020-11-03) Don't even know what I meant by this.

(From Kitware notes doc.)

build system bazel low kitware

Source

EricCousineau-TRI

All 15 comments

We're starting to trip the 12h time limit on CI. We'll have to either bump the limit, or make it faster.

I wonder if we could write kcov output to ${TEST_UNDECLARED_OUTPUTS_DIR}/kcov, and then kcov --merge it after the fact. That would let us run tests in parallel (--jobs=12) in CI, instead of one-at-a-time like we do now:

https://github.com/RobotLocomotion/drake/blob/96e0cc0b94926f0725ba69f544fda6e9200b216c/tools/dynamic_analysis/bazel.rc#L10

@rpoyner-tri Do you have any interest in taking on a fix?

jwnimmer-tri on 3 Nov 2020

I'll have a look.

rpoyner-tri on 3 Nov 2020

What happens to the coverage reports now? Are they accessible anywhere, or do they just go down the memory hole when the CI workspace is wiped?

rpoyner-tri on 4 Nov 2020

Our CI builds upload results to CDash via https://github.com/RobotLocomotion/drake-ci/blob/b5a67a456e69c660560a971822a9fd37c80d43d8/driver/configurations/bazel/step-build.cmake#L162-L183.

See https://drake-jenkins.csail.mit.edu/view/Coverage/ for the builds.
See https://drake-cdash.csail.mit.edu/viewCoverage.php?buildid=1353190 for one example report.

jwnimmer-tri on 4 Nov 2020

From local experiments, the merge strategy should work fine. I have some questions still about CI integration and advice for local/developer use.

Given this patch to drake:

diff --git a/tools/dynamic_analysis/bazel.rc b/tools/dynamic_analysis/bazel.rc
index a10b0fbf0..0aab60c6c 100644
--- a/tools/dynamic_analysis/bazel.rc
+++ b/tools/dynamic_analysis/bazel.rc
@@ -7,7 +7,7 @@ build:kcov --copt -g
 build:kcov --copt -O0
 build:kcov --spawn_strategy=standalone
 build:kcov --run_under //tools/dynamic_analysis:kcov
-build:kcov --local_test_jobs=1
+build:kcov --local_test_jobs=HOST_CPUS-1
 build:kcov --test_tag_filters=-lint,-gurobi,-mosek,-snopt,-no_kcov
 build:kcov --nocache_test_results
 # These increased timeouts were set through experimentation. Because kcov runs
diff --git a/tools/dynamic_analysis/kcov.sh b/tools/dynamic_analysis/kcov.sh
index a35e2cea0..2e9e59962 100755
--- a/tools/dynamic_analysis/kcov.sh
+++ b/tools/dynamic_analysis/kcov.sh
@@ -16,6 +16,6 @@ kcov \
     "--include-path=${WORKSPACE}" \
     --verify \
     --exclude-pattern=third_party \
-    "${WORKSPACE}/bazel-kcov" \
+    "${TEST_UNDECLARED_OUTPUTS_DIR}/kcov" \
     "--replace-src-path=/proc/self/cwd:${WORKSPACE}" \
     "$@"

All that would remain is to document the merge step for developer use and integrate it into CI. The merge step looks like this:

$ kcov --merge [OUTPUT-DIR] $(find $(bazel info bazel-testlogs) -type d -name kcov)

kcov will create the output directory if it doesn't exist. The developer steps could rely on some shell script to simplify the workflow. I assume the CI integration would require cmake surgery in drake-ci; requesting @jamiesnape advice for that.

The last detail is to remove any references to a bazel-kcov directory, as it would no longer be a magic location.

rpoyner-tri on 4 Nov 2020

For drake-ci changes, I think once we have a drake PR reviewed with all of the other changes (to settings, developer docs, etc.) we can ping Jamie and provide the recipe for the new execute_process(COMMAND ...) that we want inserted into drake-ci. We can even open a drake-ci PR with that change, but I don't think there's an easy way to test when combined with the drake PR.

Aside: I think we could remove the local_test_jobs line entirely, instead of altering it.

jwnimmer-tri on 4 Nov 2020

👍1

Corner cases:

I haven't gamed out what bazel does with undeclared-outputs data from prior build cycles; do we risk merging stale data?
I haven't gamed out what kcov does with coverage data that may already be in the destination directory. I assume it merges when it finds compatibility, and overwrites otherwise. I haven't verified that assumption.

rpoyner-tri on 4 Nov 2020

Running a larger scale test on my puget, I'm seeing massive widespread occurrence of test timeouts; needs more investigation.

rpoyner-tri on 4 Nov 2020

Yes the test timeouts effect is real. Likely cause is that each kcov-instrumented test runs as two communicating processes: a kcov parent and the test program as a child process.

Will recommend something like --localtest_jobs=HOST_CPUS*0.5 (which is a totally legal input -- nice work, bazel).

rpoyner-tri on 4 Nov 2020

Nope, even half the cpus on a puget still leads to a rash of timeouts. Will look at this all more closely.

rpoyner-tri on 4 Nov 2020

The empirical numbers here presumably need adjusting given the newly-introduced overhead from multi-process scheduling:
https://github.com/RobotLocomotion/drake/blob/4cdc13180de5fb323ef97310704950648d4bfa05/tools/dynamic_analysis/bazel.rc#L13-L17

jwnimmer-tri on 4 Nov 2020

Agreed. fiddling with that now.

rpoyner-tri on 4 Nov 2020

Checklist for integration:

[x] merge the drake-ci changes: https://github.com/RobotLocomotion/drake-ci/pull/116. Should not yet affect execution.
[x] merge the drake-ci changes: https://github.com/RobotLocomotion/drake-ci/pull/117. Ugh, fixes to 116 above.
[x] merge the drake changes: #14294. Should enable the new execution schedule.
[x] tighten up kcov_tool ci_merge error checking to fail if it finds no data. This was deliberately relaxed to smooth the multi-repo transition.
[x] remove the presence test for kcov_tool from the drake-ci code. That should finish the removal of integration shims.