Drake: Some tests failing on Windows nightly build with BAD_COMMAND

Created on 4 May 2016  Â·  48Comments  Â·  Source: RobotLocomotion/drake

https://drake-jenkins.csail.mit.edu/job/windows-msvc-64-nightly-release/100/console

The last continuous build yesterday was clean, but this problem also is cropping up on experimental builds today.

continuous integration high kitware

Most helpful comment

Also at the end of the console log:

12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CTest Result: SUCCESS BUT WITH BUILD WARNINGS
12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CDash Superbuild URL: https://drake-cdash.csail.mit.edu/index.php?project=drake-superbuild&showfilters=1&filtercount=2&showfilters=1&filtercombine=and&field1=label&compare1=61&value1=jenkins-linux-clang-experimental-2261-1225&field2=buildstarttime&compare2=84&value2=now
12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CDash URL: https://drake-cdash.csail.mit.edu/index.php?project=Drake&showfilters=1&filtercount=2&showfilters=1&filtercombine=and&field1=label&compare1=61&value1=jenkins-linux-clang-experimental-2261-1225&field2=buildstarttime&compare2=84&value2=now
12:18:58   ------------------------------------------------------------------------------

All 48 comments

The windows-msvc-32-nightly-release job passed, strangely enough. The other three all failed. There were no CI configuration changes yesterday of which I am aware.

I'm not sure if this is the same problem, but I'm seeing "Not Run" instead of "BAD_COMMAND" in this log:
https://drake-jenkins.csail.mit.edu/job/experimental/1221/compiler=msvc-32,label=windows/consoleFull

They are "Not Run" because of "BAD_COMMAND".

Do we know the cause? If not, I'll reboot my laptop to get a local Windows development environment so I can try to reproduce it locally.

Not yet. Some builds that failed on the nightly passed on a later continuous, but not all. The problem is not obvious.

OK I shall reboot into Windows and try to reproduce / debug this ASAP.

BTW, this Win64 CI build contains both "Not Run" and "BAD_COMMAND" failures.

They are not run because there are link errors:

https://drake-cdash.csail.mit.edu/viewBuildError.php?buildid=15957

@billhoffman: Thanks. I was just able to replicate them on my local Windows machine. Will push fix to my branch soon.

As a general rule, it should be easier to tell what is going on by looking at CDash instead of the giant log files in Jenkins.

@amcastro-tri and I recently noticed that the CDash link from Drake's Jenkin's website is no longer there.

For example, there's no CDash link in the page below:

screenshot from 2016-05-04 14 34 55

Did such a link exist in the past? If so, can it be restored?

It will be back soon. There is a link in the build history.

Thanks. Just to be clear, the build history is the giant console log file. Just open it and search for "CDash URL".

Also at the end of the console log:

12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CTest Result: SUCCESS BUT WITH BUILD WARNINGS
12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CDash Superbuild URL: https://drake-cdash.csail.mit.edu/index.php?project=drake-superbuild&showfilters=1&filtercount=2&showfilters=1&filtercombine=and&field1=label&compare1=61&value1=jenkins-linux-clang-experimental-2261-1225&field2=buildstarttime&compare2=84&value2=now
12:18:58   ------------------------------------------------------------------------------
12:18:58   *** CDash URL: https://drake-cdash.csail.mit.edu/index.php?project=Drake&showfilters=1&filtercount=2&showfilters=1&filtercombine=and&field1=label&compare1=61&value1=jenkins-linux-clang-experimental-2261-1225&field2=buildstarttime&compare2=84&value2=now
12:18:58   ------------------------------------------------------------------------------

To get to the build history, you go to the experimental jobs here: https://drake-jenkins.csail.mit.edu/job/experimental/. But do not use the history that you see at that level there are currently no cdash links. You have to click on the platform you want in the matrix build from here: https://drake-jenkins.csail.mit.edu/job/experimental/, like this https://drake-jenkins.csail.mit.edu/job/experimental/compiler=msvc-32,label=windows/, then the history will if cdash links.

Today I Learn (TIL). Thanks!

I think we learned as well. The CDash links need to be more prominent on Jenkins, and I would like to get them back into the PR as well without adding too many bot comments.

OK, @liangfok's legitimate build breakage is explained, but spurious errors of this kind continue to crop up, so the issue remains open. Recent example:

https://drake-jenkins.csail.mit.edu/job/experimental/1228/

I suggest that someone try reverting #2239 and see if msvc-debug goes green (blue) again. Is there a way to do that from a branch without actually reverting on master?

Push a branch to this repo (cannot be done on a fork), log into Jenkins, go to one of these jobs:

Click "build with parameters" and enter the branch name.

To be clear, I am hoping that @david-german-tri or @jamiesnape can try this; I wasn't actually volunteering to debug it myself, just suggesting one possibly-easy win.

Yeah, we can also just hit the revert button on #2239 and see what happens. I'll do that now.

I'm all for testing the revert to see if that improves the situation. But perhaps a build server expert can help explain what BAD_COMMAND means? It seems to be a ctest error code. I feel like no one understands the issue yet, so even if the revert fixes it, what would be the next step?

If the revert fixes the problem at least we've narrowed down the number if changes that caused the problem. We can then incrementally add the changes back to see when the problem returns.

@patmarion Yes, @jamiesnape is looking into it. Jamie, any news?

Still looking. Has anyone reproduced this locally?

I am back from travel, and plan to look into it today. Seems that win64 is always getting this. I think it happened between this commits:

git diff 67702283ef949fc5b96ab2e33a7a0283a143ea30 c07fd6fec41e90a626abb7cc242679019865951a

I don't see anything that would cause this in the diff. I am going to try and reproduce locally on my laptop now.

@jamiesnape Nope. I presume that whatever windows-msvc-64-continuous-debug is doing is the best way to repro, since it's been consistently broken. I don't think repro should try to use -release, because those don't consistently fail.

@jamiesnape Not yet, I agree that is the next logical step

@patmarion I should add that I do not believe #2239 is the culprit. For one thing, the failure was pre-existing on win64 at the time we merged #2239. But we'll have CI results soon. :-)

@david-german-tri FWIW, I don't think CI passing on the PR revert will tell us anything. (I guess CI PR _failing_ would mean that #2239 is unlikely to be at fault.) PR's don't have -debug configuration, which is the only one that is reliably failing. Only after merging #2288 to master and watching Continuous builds go green would we know that #2239 is a sufficient explanation for the regression.

@jwnimmer-tri You are of course right. (Unhelpfully, the CI passed.) @jamiesnape's instructions don't work either, for the same reason, and there doesn't appear to be a way to "build with parameters" on https://drake-jenkins.csail.mit.edu/view/Windows/job/windows-msvc-32-continuous-debug/

OK, I added a branch parameter to the continuous build and kicked off https://drake-jenkins.csail.mit.edu/view/Windows/job/windows-msvc-32-continuous-debug/129/

I should add that I do not believe #2239 is the culprit. For one thing, the failure was pre-existing on win64 at the time we merged #2239.

Indeed. For the record, I believe the first instance of BAD_COMMAND breaking was msvc-64-continuous build 1​82 on May 3, 2016 12:53 PM.

Build 182 (the first one to ever show this problem) combined both PR #2243 and PR #2238 into a single build. I am not sure how either of those two PRs would explain the test results (since build 110 and 111 of windows-msvc-64-continuous-debug corresponded to those PRs, and were fine), but FYI to the crowd.

It is a tough one to track down. I just did an experimental win 64 release build which had showed the issue twice for me. But, this time I remote desktoped into the machine. It of course passed! Then I remoted into a running PR build and it failed, but also cleaned up the build after it was done so I could not examine the files. I have not been able to reproduce this locally even running the same script that CI uses.

Oh goodie. Build system race conditions.

For debug builds, the common thread is that c4.4xlarge is passing and c4.8xlarge is failing. Release builds apparently don't always hit the race, though c4.4xlarge never fails, and c4.8xlarge sometimes fails.

I thought there were no CI configuration edits thought to be in play?

To put Jenkins back into good health, perhaps we can switch msvc back to c4.4xlarge for now, while the build system error is root caused?

Or _shrug_ don't really need it done for the weekend, maybe we all just get a fresh set of eyes on it first thing next week.

@david-german-tri I said the experimental job. The continuous job would not have parameters as it necessarily always runs on master.

I am going to take a look at this today. You are correct all the issues showed up with c4.8xlarge. That should help track this down. I am going to create a c4.8xlarge system and run from the command line and hopefully it reproduces. You can see the switch to 8xlarge causing the failure on CDash here:
https://drake-cdash.csail.mit.edu/index.php?project=Drake&showfilters=1&filtercount=2&showfilters=1&filtercombine=and&field1=label&compare1=63&value1=jenkins-windows-msvc-32-nightly-debug&field2=buildstarttime&compare2=84&value2=now

I thought there were no CI configuration edits thought to be in play?

True. It should only be MATLAB builds running on c4.8xlarge. I will investigate.

@billhoffman Great, thanks!

My only other thought was if there is a timeout on tests, where launching 36 unit tests at once somehow saturates the network or otherwise causes them to be too slow to run.

That's a shot in the dark, but FYI in case it helps you during debugging. Thanks for digging into this.

It looks like windows and windows_matlab labels are erroneously both being added to the c4.8xlarge instances. The trouble is that we will need those instance types for MATLAB builds anyway, so fixing that and running on c4.4xlarge would only be temporary fix.

@jamiesnape

The continuous job would not have parameters as it necessarily always runs on master.

Yes, understood, but the error occurs more reliably on debug builds, which are not part of the experimental job. So I parameterized the msvc-32-continuous-debug build by branch, ran it on the rollback PR for #2239, and it still failed.

All the -experimental builds are.

I see the issue. You cannot see the debug parameter.

I see the issue. You cannot see the debug parameter.

To clarify, I had omitted to add the parameter to that build, not that it was hidden per se.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jamiesnape picture jamiesnape  Â·  5Comments

SeanCurtis-TRI picture SeanCurtis-TRI  Â·  4Comments

palmieri picture palmieri  Â·  4Comments

mntan3 picture mntan3  Â·  4Comments

Islam0mar picture Islam0mar  Â·  4Comments