Bazel: Bazel CI: Android Testing is failing on Windows with Bazel@HEAD

Created on 5 Dec 2018  路  22Comments  路  Source: bazelbuild/bazel

https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/649#8425be19-6d12-4ec4-934b-228d76be474c

ERROR: D:/b/ny24af4c/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/file-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
聽 | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/file-jni.cc':
聽 | 'D:/b/hh7flgnl/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

Culprit Finder says d235a06ede2247479dd8e5c315d55df4d657a618 is the culprt

@laszlocsomor

P1 breakage bug

Most helpful comment

I believe I have understand the underlying problem. cbacbb9 actually revealed a header checking bug when using remote execution.
Before cbacbb9, we filter the original output during the actual execution of the compiling action, which means the store action output is already filtered. In later builds, when we reply the output from the stored file due to remote cache hit, the output is already filtered, thus we capture no header file dependencies anymore.

After cbacbb9, we store the output to file first, so no matter remote cache is hit or not, we always parse the original output. Because the output could contain absolute path (execroot) from a different machine, we got this problem.

I'm working on a fix to remote the execroot prefilx from include file paths in the output. We'll see if that works.

All 22 comments

FYI @buchgr

@meteorcloudy , how do i find the commit that Bazel was built at? I cannot repro this locally with Bazel built at f2b3bec088cd24062df07056a7b7061ced812b09.

Also, the output root on CI is under C:/windows/system32/config/systemprofile -- that doesn't look good.

Never mind, found it in the "Build Bazel (Windows)" task on the same page!

Great, we explicitly set the output user root to D:/b, so it shouldn't be C:/windows/system32/config/systemprofile
https://github.com/bazelbuild/continuous-integration/blob/master/buildkite/bazelci.py#L856

Oh, you're right! I was looking at the "Bazel Info" part of the log. That does not use the --output_user_root flag.

I see, but it's better we use --output_user_root flag also for bazel info. I'll send a change.

I tried to rerun the job in culprit finder
https://buildkite.com/bazel/culprit-finder/builds/41#b24d1754-a989-47f4-86e9-0345e771560a
This time it says everything is passing..

Looks like it's a flaky error, d235a06 is certainly not the culprit, sorry for the noise here.

No worries! Thanks for looking again!

Are P1 and Team-Windows adequate?

Still need to look into what caused it, it can be stably reproduced in downstream
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/649#1064573d-b8f3-4b7a-be4c-cf6467da4312

As a side-note, it would help repro-ing if some step (maybe "Bazel Info") printed the client environment. Is that possible?

Yes, that's doable!

Similar failure is happening to TensorFlow https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/679#0910f88b-7b15-49d8-97c6-6d73f3aa45d6

Let me take a look!

I noticed something in https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/679#71068d7e-b147-46c6-a7ea-dcf6ddb97425 that I didn't before:

ERROR: D:/b/fkulpsov/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/file-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
聽 | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/file-jni.cc':
聽 | 'D:/b/hh7flgnl/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

The error comes from D:/b/fkulpsov/... but it complains about an inclusion from D:/b/hh7flgnl/.... I have no idea what hh7flgnl is or where it comes from -- the output base (according to the "Bazel Info" step) is D:/b/fkulpsov.

Hmmm, a wild guess: could this be a poisoned cache entry?

Header validation on Windows works by asking the compiler to write the list of included files (via the /showIncludes flag), then validating the list against the allowed list of files. The VC++ compiler however outputs absolute paths. and Bazel strips known prefixes from these paths (e.g. the execution root) to get relative paths.

I haven't yet figured out how, but maybe the compiler's output from an earlier build (with a different output root) was somehow written to file in the output cache, and incorrectly retrieved for header validation?

A cache issue would also explain why this issue is flaky, as pointed out by @meteorcloudy here: https://github.com/bazelbuild/bazel/issues/6847#issuecomment-444486473

More evidence for the cache poisoning:

In the previous build of "Android Testing" on Windows (678), the build failed with a similar error (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/678#515c01ce-c6a1-4f9e-989f-ec4aef899b24):

ERROR: D:/b/ncxufbua/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/processes-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
聽 | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/processes-jni.cc':
聽 | 'D:/b/hh7flgnl/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

output base was D:/b/ncxufbua, and the missing inclusion was again from D:/b/hh7flgnl.

In the previous build before that (677) (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/677#9a4f231f-c38a-4b6a-a691-38be5ec7768b), "Android Testing" on Windows built successfully. Guess what the output base was? It was D:/b/hh7flgnl.

My theory is that build 677 inserted a cache entry that builds 678 and 679 incorrectly retrieved, and failed header validation because they tried stripping the wrong execroot prefix.

@lberki told me it's quite possible that Bazel stores the action's stdout in the remote cache, because that's how Bazel can replay the action's output if it was cached.

The latest "Android Testing" build as of now is 689 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/689#56cc7a37-3397-4d8a-b06b-ea7bb02980fe) and it's failing just like the others, only with different paths:

ERROR: D:/b/wanmbwio/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/processes-jni.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
聽 | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/processes-jni.cc':
聽 | 'D:/b/tggxeo2z/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

This can be traced back to build 682 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/682#98aa10ac-cbc6-43de-9aea-dd9cda8ce0a4) that also failed (which I find strange), because how did it insert a cache entry then? Or maybe the output root D:/b/tggxeo2z was reused in this build and was already used in an earlier successful build that inserted this cache entry?

ERROR: D:/b/tggxeo2z/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/jni-util.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
聽 | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/jni-util.cc':
聽 | 'D:/b/ny24af4c/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

Build 681 also used D:/b/tggxeo2z output root, and failed with inclusion ostensibly from D:/b/ny24af4c:

ERROR: D:/b/tggxeo2z/external/bazel_tools/src/main/native/windows/BUILD:42:1: Couldn't build file external/bazel_tools/src/main/native/windows/_objs/windows_jni.dll/jni-util.obj: undeclared inclusion(s) in rule '@bazel_tools//src/main/native/windows:windows_jni.dll':
--
聽 | this rule is missing dependency declarations for the following files included by 'external/bazel_tools/src/main/native/windows/jni-util.cc':
聽 | 'D:/b/ny24af4c/execroot/__main__/bazel-out/host/genfiles/external/bazel_tools/src/main/native/jni_md.h'

Build 680 (https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/680#905b6889-1c55-43a4-af02-22d19f9b4e33) used the D:/b/ny24af4c output root and failed with undeclared inclusion from D:/b/hh7flgnl that we already know from builds 678 and 679.

@buchgr , @meteorcloudy :
Do you know:

  • whether my theory is plausible, i.e. whether Bazel really stores the compilation action's outErr in the remote cache?
  • how we could postprocess the action's outErr to remove the execroot prefix, so the cached file will only have relative paths?
  • if it's possible to purge bad entries from the cache, or purge the whole cache that CI uses?

@laszlocsomor Thanks for looking into this! Your conclusion is very plausible, I suspect it's caused by https://github.com/bazelbuild/bazel/commit/cbacbb9083efab215c1c8015f7488cf8b2747889.

@ulfjack confirmed we do cache the action's outErr in remote cache. But we haven't understand why it's not happening before.

I believe I have understand the underlying problem. cbacbb9 actually revealed a header checking bug when using remote execution.
Before cbacbb9, we filter the original output during the actual execution of the compiling action, which means the store action output is already filtered. In later builds, when we reply the output from the stored file due to remote cache hit, the output is already filtered, thus we capture no header file dependencies anymore.

After cbacbb9, we store the output to file first, so no matter remote cache is hit or not, we always parse the original output. Because the output could contain absolute path (execroot) from a different machine, we got this problem.

I'm working on a fix to remote the execroot prefilx from include file paths in the output. We'll see if that works.

After some investigation, I think filtering the original output before storing it to file is not possible without applying a filter to the FileOutErr class, but that's incompatible with some SpawnRunner implementations. That's why @ulfjack authorized cbacbb9. So I had to go with a suboptimal solution, please see #6931 .

Was this page helpful?
0 / 5 - 0 ratings