[node=windows-x86_64,toolchain=msvc] FAIL: //src/test/java/com/google/devtools/build/lib/rules/android:AndroidBinaryTest (shard 4 of 5) (see C:/windows/temp/_bazel_system/7bqgevoe/execroot/io_bazel/bazel-out/msvc_x64-fastbuild/testlogs/src/test/java/com/google/devtools/build/lib/rules/android/AndroidBinaryTest/shard_4_of_5/test.log).
[node=windows-x86_64,toolchain=msvc] INFO: From Testing //src/test/java/com/google/devtools/build/lib/rules/android:AndroidBinaryTest (shard 4 of 5):
[node=windows-x86_64,toolchain=msvc] ==================== Test output for //src/test/java/com/google/devtools/build/lib/rules/android:AndroidBinaryTest (shard 4 of 5):
[node=windows-x86_64,toolchain=msvc] java.lang.NullPointerException
[node=windows-x86_64,toolchain=msvc] at java.util.jar.Attributes.read(Attributes.java:394)
[node=windows-x86_64,toolchain=msvc] at java.util.jar.Manifest.read(Manifest.java:199)
[node=windows-x86_64,toolchain=msvc] at java.util.jar.Manifest.
[node=windows-x86_64,toolchain=msvc] at sun.tools.jar.Main.run(Main.java:176)
[node=windows-x86_64,toolchain=msvc] at sun.tools.jar.Main.main(Main.java:1288)
[node=windows-x86_64,toolchain=msvc] C:/Windows/Temp/_bazel_SYSTEM/7bqGEvoe/execroot/io_bazel/bazel-out/msvc_x64-fastbuild/bin/src/test/java/com/google/devtools/build/lib/rules/android/AndroidBinaryTest: ERROR: /c/Windows/Temp/_bazel_SYSTEM/7bqGEvoe/execroot/io_bazel/bazel-out/msvc_x64-fastbuild/bin/src/test/java/com/google/devtools/build/lib/rules/android/AndroidBinaryTest failed because C:/windows/temp/_bazel_system/7bqgevoe/external/local_jdk/bin/jar.exe failed
Jenkins:
http://ci.bazel.io/job/bazel-tests/1024/testReport/src_test_java_com_google_devtools_build_lib_rules_android_AndroidBinaryTest/cmd_shard_4_5/src_test_java_com_google_devtools_build_lib_rules_android_AndroidBinaryTest_cmd_shard_4_5/
http://ci.bazel.io/job/bazel-tests/1024/consoleFull
FYI @laszlocsomor @dslomov
Passes on my machine at commit 98c831ec11d8bea47bcafce69a22df716381da5a with Bazel 0.5.4 .
Let me investigate further.
C:\work\bazel>bazel test //src/test/java/com/google/devtools/build/lib/rules/android:AndroidBinaryTest
(...)
INFO: Elapsed time: 273.722s, Critical Path: 111.54s
//src/test/java/com/google/devtools/build/lib/rules/android:AndroidBinaryTest PASSED in 32.1s
From http://ci.bazel.io/blue/organizations/jenkins/bazel-tests/detail/bazel-tests/1024/pipeline/31 I found out that the workspace was c:\jenkins\workspace\bazel-tests-node=windows-x86_64,toolchain=msvc.
I managed to RDP into the machine. In an administrator MSYS shell (non-elevated didn't work, I suspect permission issues), I found out that the jar.exe in the failed invocation is different from the one in a recent run.
laszlocsomor@windows-with-visual-studio MSYS ~
$ realpath /c/windows/temp/_bazel_system/5HtUcGP4/external/local_jdk/bin/jar.exe
/c/Program Files/java/jdk1.8.0_144/bin/jar.exe
laszlocsomor@windows-with-visual-studio MSYS ~
$ realpath C:/windows/temp/_bazel_system/7bqgevoe/external/local_jdk/bin/jar.exe
/c/windows/temp/_bazel_SYSTEM/install/f8e4afb1dfae5304ee3ab8d76fb7ecaf/_embedded_binaries/embedded_tools/jdk/bin/jar.exe
laszlocsomor@windows-with-visual-studio MSYS ~
$ ls -la /c/windows/temp/_bazel_system/5HtUcGP4/external/local_jdk/bin/jar.exe
-rwxr-xr-x 1 laszlocsomor None 15904 Aug 28 15:58 /c/windows/temp/_bazel_system/5HtUcGP4/external/local_jdk/bin/jar.exe
laszlocsomor@windows-with-visual-studio MSYS ~
$ ls -la C:/windows/temp/_bazel_system/7bqgevoe/external/local_jdk/bin/jar.exe
-rwxr-xr-x 1 laszlocsomor None 16384 Jan 1 2027 C:/windows/temp/_bazel_system/7bqgevoe/external/local_jdk/bin/jar.exe
What we see is , the failed run used an embedded JDK while the earlier, successful(?) one used an external JDK.
Bazel was c:\bazel_ci\installs\latest\bazel.exe and it is indeed the 0.5.4 release, its MD5 matches that of the Bazel binary on my machine:
laszlocsomor@windows-with-visual-studio MSYS ~
$ C:/bazel_ci/installs/latest/bazel.exe --batch version
Build label: 0.5.4
Build target: bazel-out/msvc_x64-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Aug 25 09:59:45 2017 (1503655185)
Build timestamp: 1503655185
Build timestamp as int: 1503655185
Extracting Bazel installation...
laszlocsomor@windows-with-visual-studio MSYS ~
$ md5sum C:/bazel_ci/installs/latest/bazel.exe
234135feee0f7f696545b31cc3fd6a8b *C:/bazel_ci/installs/latest/bazel.exe
laszlocsomor@windows-with-visual-studio MSYS ~
$ du -h C:/bazel_ci/installs/latest/bazel.exe
192M C:/bazel_ci/installs/latest/bazel.exe
Even more interestingly, I can run the old jar.exe from a non-elevated shell, but I can't run the new one.
The new (failing) jar.exe has the same md5sum as that on my machine though so the binary doesn't seem to be corrupt.
Interestingly, after I looked for the file in Windows Explorer (had to elevate access) and at the file's properties and permissions, and copying that JDK directory elsewhere, I can now run the original file (C:\windows\temp\_bazel_system\7bqgevoe\external\local_jdk\bin\jar.exe).
I don't think I did anything else with this file. Yet I've seen something similar in the past -- looking at the file permissions causing an update in them -- in any case something fishy is going on.
My guess is that the temp directory being "c:\windows\temp" might have played a role, but I have no reason to back it up. I simply have no other clue.
That said I updated the system envvars on the machine so TMP=TEMP=C:\temp now. Let's see if this test failure was a flake and if it happens again.
Thank you!
Apparently this is not a flake, @ulfjack just reported this internally too.
I cannot work on this issue right now.
@philwo , have you seen it on CI since yesterday?
@philwo : Are you sure P0 is adequate? Is this issue breaking anything but CI?
@laszlocsomor It's not my choice, it's what the Bazel Sheriff Duties doc says :|
Monitor ci.bazel.io: if a job fails, find the causing CL and:
notify the original author of the change:
P1 bug on Github assigned to the author with the label "breakage".
set priority to P0 if it makes bazel-tests job red.
I think it's my job to find the culprit CL and in theory roll it back, but as you said, the repro case is not clear.
I see P0 as "let's work on this ASAP", where "ASAP" literally means as soon as possible and if not possible (because you currently have even more important things to do or aren't able, because OOO or sick or any other reason) then that's the way it is. The world won't end - it's a CI breaking bug, not a global outage of a production service.
Maybe reassign to someone else on the Windows team who can take it?
I'll do the daily GitHub push now, let's see if it happens again during that CI run.
Thanks!
@dslomov , @meteorcloudy : FYI
I can repo this issue locally, I'll take a look.
Thank you!
Looks like sharding the AndroidBinaryTest makes it failing on Windows somehow.
I reverted https://github.com/bazelbuild/bazel/commit/e92e3d8a4fcab49cda1a2bebc02d5119798d5258 and it's passing.
From the error message:
java.lang.NullPointerException
at java.util.jar.Attributes.read(Attributes.java:394)
at java.util.jar.Manifest.read(Manifest.java:199)
at java.util.jar.Manifest.<init>(Manifest.java:69)
at sun.tools.jar.Main.run(Main.java:176)
at sun.tools.jar.Main.main(Main.java:1288)
C:/Windows/Temp/_bazel_SYSTEM/7bqGEvoe/execroot/io_bazel/bazel-out/msvc_x64-fastbuild/bin/src/test/java/com/google/devtools/build/lib/rules/android/AndroidBinaryTest: ERROR: /c/Windows/Temp/_bazel_SYSTEM/7bqGEvoe/execroot/io_bazel/bazel-out/msvc_x64-fastbuild/bin/src/test/java/com/google/devtools/build/lib/rules/android/AndroidBinaryTest failed because C:/windows/temp/_bazel_system/7bqgevoe/external/local_jdk/bin/jar.exe failed
I guess it might because when we run multiple AndroidBinaryTest at the same time, all of them trying to create the same classpath jar, which leads to a conflict.
That sounds like a plausible reason! How does the launcher compute the location of the class jar, is it next to the launcher, derived from the launcher's name?
So it's not the jar.exe itself that's corrupt or has bad permissions; jar.exe runs fine and just cannot clobber the output because concurrent jar.exe's are trying to write the same file.
This is a scenario where having a $TMP/$TEMP envvar would be very useful.
That's right! The launcher (both bash and exe launcher) compute the classpath jar path next to the launcher and with a -classpath suffix. Sounds like a good idea to generate classpath jar in a TMP directory.
Haha, I was bitten by the same problem a couple months ago while I was hacking on building Android rules, and added code to the launcher batch script to write the args to a temp file.
Sounds like a good idea to generate classpath jar in a TMP directory.
Yes, and I recommend creating a temp directory within the $TMP directory, because $TMP may be shared between actions.
Yes! We should introduce some randomness to the classpath jar file path. I'll do the refactoring.