Drake: Some tests failing on Windows nightly build with BAD_COMMAND

Created on 4 May 2016 · 48Comments · Source: RobotLocomotion/drake

https://drake-jenkins.csail.mit.edu/job/windows-msvc-64-nightly-release/100/console

The last continuous build yesterday was clean, but this problem also is cropping up on experimental builds today.

continuous integration high kitware

Source

david-german-tri

Most helpful comment

Also at the end of the console log:

12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CTest Result: SUCCESS BUT WITH BUILD WARNINGS
12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CDash Superbuild URL: https://drake-cdash.csail.mit.edu/index.php?project=drake-superbuild&showfilters=1&filtercount=2&showfilters=1&filtercombine=and&field1=label&compare1=61&value1=jenkins-linux-clang-experimental-2261-1225&field2=buildstarttime&compare2=84&value2=now
12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CDash URL: https://drake-cdash.csail.mit.edu/index.php?project=Drake&showfilters=1&filtercount=2&showfilters=1&filtercombine=and&field1=label&compare1=61&value1=jenkins-linux-clang-experimental-2261-1225&field2=buildstarttime&compare2=84&value2=now
12:18:58   ------------------------------------------------------------------------------

jamiesnape on 4 May 2016

👍2

All 48 comments

The windows-msvc-32-nightly-release job passed, strangely enough. The other three all failed. There were no CI configuration changes yesterday of which I am aware.

jamiesnape on 4 May 2016

I'm not sure if this is the same problem, but I'm seeing "Not Run" instead of "BAD_COMMAND" in this log:
https://drake-jenkins.csail.mit.edu/job/experimental/1221/compiler=msvc-32,label=windows/consoleFull

liangfok on 4 May 2016

They are "Not Run" because of "BAD_COMMAND".

jamiesnape on 4 May 2016

👍1

Do we know the cause? If not, I'll reboot my laptop to get a local Windows development environment so I can try to reproduce it locally.

liangfok on 4 May 2016

Not yet. Some builds that failed on the nightly passed on a later continuous, but not all. The problem is not obvious.

jamiesnape on 4 May 2016

OK I shall reboot into Windows and try to reproduce / debug this ASAP.

BTW, this Win64 CI build contains both "Not Run" and "BAD_COMMAND" failures.

liangfok on 4 May 2016

They are not run because there are link errors:

https://drake-cdash.csail.mit.edu/viewBuildError.php?buildid=15957

billhoffman on 4 May 2016

@billhoffman: Thanks. I was just able to replicate them on my local Windows machine. Will push fix to my branch soon.

liangfok on 4 May 2016

As a general rule, it should be easier to tell what is going on by looking at CDash instead of the giant log files in Jenkins.

billhoffman on 4 May 2016

@amcastro-tri and I recently noticed that the CDash link from Drake's Jenkin's website is no longer there.

For example, there's no CDash link in the page below:

screenshot from 2016-05-04 14 34 55

Did such a link exist in the past? If so, can it be restored?

liangfok on 4 May 2016

It will be back soon. There is a link in the build history.

jamiesnape on 4 May 2016

👍1

Thanks. Just to be clear, the build history is the giant console log file. Just open it and search for "CDash URL".

liangfok on 4 May 2016

👍1

Build history is here: https://drake-jenkins.csail.mit.edu/job/experimental/compiler=clang,label=linux/

jamiesnape on 4 May 2016

👍1

Also at the end of the console log:

12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CTest Result: SUCCESS BUT WITH BUILD WARNINGS
12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CDash Superbuild URL: https://drake-cdash.csail.mit.edu/index.php?project=drake-superbuild&showfilters=1&filtercount=2&showfilters=1&filtercombine=and&field1=label&compare1=61&value1=jenkins-linux-clang-experimental-2261-1225&field2=buildstarttime&compare2=84&value2=now
12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CDash URL: https://drake-cdash.csail.mit.edu/index.php?project=Drake&showfilters=1&filtercount=2&showfilters=1&filtercombine=and&field1=label&compare1=61&value1=jenkins-linux-clang-experimental-2261-1225&field2=buildstarttime&compare2=84&value2=now
12:18:58   ------------------------------------------------------------------------------

jamiesnape on 4 May 2016

👍2

To get to the build history, you go to the experimental jobs here: https://drake-jenkins.csail.mit.edu/job/experimental/. But do not use the history that you see at that level there are currently no cdash links. You have to click on the platform you want in the matrix build from here: https://drake-jenkins.csail.mit.edu/job/experimental/, like this https://drake-jenkins.csail.mit.edu/job/experimental/compiler=msvc-32,label=windows/, then the history will if cdash links.

billhoffman on 4 May 2016

👍1

Today I Learn (TIL). Thanks!

liangfok on 4 May 2016

I think we learned as well. The CDash links need to be more prominent on Jenkins, and I would like to get them back into the PR as well without adding too many bot comments.

billhoffman on 4 May 2016

OK, @liangfok's legitimate build breakage is explained, but spurious errors of this kind continue to crop up, so the issue remains open. Recent example:

https://drake-jenkins.csail.mit.edu/job/experimental/1228/

david-german-tri on 4 May 2016

I suggest that someone try reverting #2239 and see if msvc-debug goes green (blue) again. Is there a way to do that from a branch without actually reverting on master?

jwnimmer-tri on 6 May 2016

Push a branch to this repo (cannot be done on a fork), log into Jenkins, go to one of these jobs:

Click "build with parameters" and enter the branch name.

jamiesnape on 6 May 2016

To be clear, I am hoping that @david-german-tri or @jamiesnape can try this; I wasn't actually volunteering to debug it myself, just suggesting one possibly-easy win.

jwnimmer-tri on 6 May 2016

Yeah, we can also just hit the revert button on #2239 and see what happens. I'll do that now.

david-german-tri on 6 May 2016

I'm all for testing the revert to see if that improves the situation. But perhaps a build server expert can help explain what BAD_COMMAND means? It seems to be a ctest error code. I feel like no one understands the issue yet, so even if the revert fixes it, what would be the next step?

patmarion on 6 May 2016

If the revert fixes the problem at least we've narrowed down the number if changes that caused the problem. We can then incrementally add the changes back to see when the problem returns.

liangfok on 6 May 2016

@patmarion Yes, @jamiesnape is looking into it. Jamie, any news?

david-german-tri on 6 May 2016

Still looking. Has anyone reproduced this locally?

jamiesnape on 6 May 2016

I am back from travel, and plan to look into it today. Seems that win64 is always getting this. I think it happened between this commits:

git diff 67702283ef949fc5b96ab2e33a7a0283a143ea30 c07fd6fec41e90a626abb7cc242679019865951a

I don't see anything that would cause this in the diff. I am going to try and reproduce locally on my laptop now.

billhoffman on 6 May 2016

@jamiesnape Nope. I presume that whatever windows-msvc-64-continuous-debug is doing is the best way to repro, since it's been consistently broken. I don't think repro should try to use -release, because those don't consistently fail.

jwnimmer-tri on 6 May 2016

@jamiesnape Not yet, I agree that is the next logical step

@patmarion I should add that I do not believe #2239 is the culprit. For one thing, the failure was pre-existing on win64 at the time we merged #2239. But we'll have CI results soon. :-)

david-german-tri on 6 May 2016

@david-german-tri FWIW, I don't think CI passing on the PR revert will tell us anything. (I guess CI PR _failing_ would mean that #2239 is unlikely to be at fault.) PR's don't have -debug configuration, which is the only one that is reliably failing. Only after merging #2288 to master and watching Continuous builds go green would we know that #2239 is a sufficient explanation for the regression.

jwnimmer-tri on 6 May 2016

@jwnimmer-tri You are of course right. (Unhelpfully, the CI passed.) @jamiesnape's instructions don't work either, for the same reason, and there doesn't appear to be a way to "build with parameters" on https://drake-jenkins.csail.mit.edu/view/Windows/job/windows-msvc-32-continuous-debug/

david-german-tri on 6 May 2016

OK, I added a branch parameter to the continuous build and kicked off https://drake-jenkins.csail.mit.edu/view/Windows/job/windows-msvc-32-continuous-debug/129/

david-german-tri on 6 May 2016

and #2239 is officially exonerated: https://drake-jenkins.csail.mit.edu/view/Windows/job/windows-msvc-32-continuous-debug/129/

david-german-tri on 6 May 2016

I should add that I do not believe #2239 is the culprit. For one thing, the failure was pre-existing on win64 at the time we merged #2239.

Indeed. For the record, I believe the first instance of BAD_COMMAND breaking was msvc-64-continuous build 182 on May 3, 2016 12:53 PM.

jwnimmer-tri on 7 May 2016

Build 182 (the first one to ever show this problem) combined both PR #2243 and PR #2238 into a single build. I am not sure how either of those two PRs would explain the test results (since build 110 and 111 of windows-msvc-64-continuous-debug corresponded to those PRs, and were fine), but FYI to the crowd.

jwnimmer-tri on 7 May 2016

It is a tough one to track down. I just did an experimental win 64 release build which had showed the issue twice for me. But, this time I remote desktoped into the machine. It of course passed! Then I remoted into a running PR build and it failed, but also cleaned up the build after it was done so I could not examine the files. I have not been able to reproduce this locally even running the same script that CI uses.

billhoffman on 7 May 2016

Oh goodie. Build system race conditions.

For debug builds, the common thread is that c4.4xlarge is passing and c4.8xlarge is failing. Release builds apparently don't always hit the race, though c4.4xlarge never fails, and c4.8xlarge sometimes fails.

I thought there were no CI configuration edits thought to be in play?

jwnimmer-tri on 7 May 2016

To put Jenkins back into good health, perhaps we can switch msvc back to c4.4xlarge for now, while the build system error is root caused?

Or _shrug_ don't really need it done for the weekend, maybe we all just get a fresh set of eyes on it first thing next week.

jwnimmer-tri on 7 May 2016

@david-german-tri I said the experimental job. The continuous job would not have parameters as it necessarily always runs on master.

jamiesnape on 9 May 2016

I am going to take a look at this today. You are correct all the issues showed up with c4.8xlarge. That should help track this down. I am going to create a c4.8xlarge system and run from the command line and hopefully it reproduces. You can see the switch to 8xlarge causing the failure on CDash here:
https://drake-cdash.csail.mit.edu/index.php?project=Drake&showfilters=1&filtercount=2&showfilters=1&filtercombine=and&field1=label&compare1=63&value1=jenkins-windows-msvc-32-nightly-debug&field2=buildstarttime&compare2=84&value2=now

billhoffman on 9 May 2016

I thought there were no CI configuration edits thought to be in play?

True. It should only be MATLAB builds running on c4.8xlarge. I will investigate.

jamiesnape on 9 May 2016

@billhoffman Great, thanks!

My only other thought was if there is a timeout on tests, where launching 36 unit tests at once somehow saturates the network or otherwise causes them to be too slow to run.

That's a shot in the dark, but FYI in case it helps you during debugging. Thanks for digging into this.

jwnimmer-tri on 9 May 2016

It looks like windows and windows_matlab labels are erroneously both being added to the c4.8xlarge instances. The trouble is that we will need those instance types for MATLAB builds anyway, so fixing that and running on c4.4xlarge would only be temporary fix.

jamiesnape on 9 May 2016

@jamiesnape

The continuous job would not have parameters as it necessarily always runs on master.

Yes, understood, but the error occurs more reliably on debug builds, which are not part of the experimental job. So I parameterized the msvc-32-continuous-debug build by branch, ran it on the rollback PR for #2239, and it still failed.

david-german-tri on 9 May 2016

This build is parameterized: https://drake-jenkins.csail.mit.edu/job/windows-msvc-64-experimental/

jamiesnape on 9 May 2016

All the -experimental builds are.

jamiesnape on 9 May 2016

I see the issue. You cannot see the debug parameter.

jamiesnape on 9 May 2016