Bazel: Bazel 0.21.0 flipped --incompatible_strict_action_env to true and broke itself

Created on 2 Jan 2019 · 55Comments · Source: bazelbuild/bazel

Description of the problem / feature request:

Bazel 0.21.0 flipped --incompatible_strict_action_env to true via https://github.com/bazelbuild/bazel/issues/6648, which broke CI on macOS and Windows CI. On macOS we can no longer find "md5" (which is in /sbin) and on Windows we can no longer find Python, PowerShell and other tools.

We didn't discover these breakages during testing, because apparently we don't actually test Bazel itself in the "Bazel@HEAD + Downstream Projects".

We'll revert this via a flag in the CI config files, but flag owners should work on fixing our tests (and taking user feedback into account whether this new behavior is actually working well for others) so that we can eventually remove the config entry again.

Relevant logs:

Shell tests on macOS can't find "md5": https://storage.googleapis.com/bazel-buildkite-artifacts/08029882-a54b-44b4-bb8c-01d3fca82488/src/test/shell/shell_utils_test/attempt_1.log
Tests on Windows can't find Python: https://storage.googleapis.com/bazel-buildkite-artifacts/31d11339-45e6-4efa-8eb1-b61f7db6eba8/src%5Ctest%5Cjava%5Ccom%5Cgoogle%5Cdevtools%5Cbuild%5Clib%5Cblackbox%5Ctests%5CPythonBlackBoxTest%5Cattempt_1.log
Tests on Windows can't find Git: https://storage.googleapis.com/bazel-buildkite-artifacts/31d11339-45e6-4efa-8eb1-b61f7db6eba8/src%5Ctest%5Cshell%5Cbazel%5Cgit_repository_test%5Cattempt_1.log
Tests on Windows can't find Powershell: https://storage.googleapis.com/bazel-buildkite-artifacts/31d11339-45e6-4efa-8eb1-b61f7db6eba8/src%5Ctest%5Cjava%5Ccom%5Cgoogle%5Cdevtools%5Cbuild%5Clib%5Cblackbox%5Cframework%5Cframework_tests%5Cattempt_1.log

P2 team-Remote-Exec process

Source

philwo

👍2

All 55 comments

Philipp, thanks for finding, filing, and working around this bug.

We (@philwo , @hlopko , and myself) argued about this flag today and realized that, while sh_* rules should probably not run python directly (why would a shell script need to do that?), they should be able to run their py_binary data-dependencies, which rightfully expect to have python / python.exe on the PATH.

It's unclear to me what should Bazel do. It seems that PATH's whole purpose is to contain the locations of these common tools, in which case PATH should be exempt from the "strict" env and be available to the action. On the other hand, PATH is inherently machine-specific and as such it poisons remote caches.

Another important observation @philwo made is that the "strict" action env is just as non-hermetic as the non-strict one; we just replace the clientenv's PATH with a hardcoded PATH string, but whatever is in those directories is out of our control (and not tracked by Bazel). And on one machine they might have python (and point to python2) but on another they might have no python at all or have python pointing to python3.

laszlocsomor on 2 Jan 2019

👍1

One option would be to make PATH work more like TMPDIR, where it's not part of the action key and injected by the spawn runner at the last moment. That would be a step towards fixing #6392, too.

benjaminp on 2 Jan 2019

To prevent this from happening again, I filed https://github.com/bazelbuild/continuous-integration/pull/434

meteorcloudy on 2 Jan 2019

they should be able to run their py_binary data-dependencies, which rightfully expect to have python / python.exe on the PATH. [...] On the other hand, PATH is inherently machine-specific and as such it poisons remote caches.

I can't speak for PATH in general, but from the point of view of the Python interpreter: The long-term solution is to define the interpreter as a toolchain rule, where a particular toolchain target is appropriate for a specific system / path installation. The toolchain target for the current host system could be generated by a repository rule.

Currently py_runtime is the closest thing we have to a toolchain. The Python stub script only runs the python on PATH if no py_runtime was specified via --python_top.

brandjon on 4 Jan 2019

👍1

@brandjon : Can you explain how executing a py_binary-generated target would work if the python interpreter is a toolchain? As I see it, running bazel-bin/foo/py/bin (which is currently a symlink to the stub python script, whose shebang line is #!/usr/bin/env python) requires an interpreter on the PATH.

laszlocsomor on 7 Jan 2019

And this is only the Linux/macOS stub script. On Windows we have another stub that is a C++ program (called the "launcher"). @meteorcloudy could correct me on this, but IIRC the local python interpreter's path is embedded in the launcher.

laszlocsomor on 7 Jan 2019

@laszlocsomor The native launcher also only uses python on PATH if no py_runtime was specified, basically it always uses the same python binary as the python stub template, see https://github.com/bazelbuild/bazel/blob/f451cd89f1b5a3c2d1ef266d5444d0c42f93e322/src/main/java/com/google/devtools/build/lib/bazel/rules/python/BazelPythonSemantics.java#L200

That said, the python binary (including the native launcher for python) won't be a problem if python path is explicitly specified even with --incompatible_strict_action_env

meteorcloudy on 7 Jan 2019

I think we need to change the default PATH on macos and Windows. It might be worthwhile to roll back the flag flip, except it may be too late now?

ulfjack on 7 Jan 2019

@meteorcloudy : Thanks, that's comforting to know.
@ulfjack : I agree that it's probably too late to roll back the flag flip commit now. What do you think the default PATH should contain on macOS and Windows?

laszlocsomor on 7 Jan 2019

Regarding Windows:

Since the Bazel windows prereq page lists the minimum Windows version as Windows Server 2008 R2, and that version includes powershell 2.0, it's reasonable to add it on the path.
The prereq page doesn't list python though. It should only be added if it's a prereq.

filipesilva on 7 Jan 2019

👍1

I don't have a mac available right now, but I take it /sbin is on the PATH by default, so we should add it to the default path in Bazel as well.

It's
/Users/[user]/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:
according to https://www.reddit.com/r/MacOS/comments/5zabo3/question_what_is_the_default_path_for_macos/, although we should leave out the users home bin directory for hermeticity.

ulfjack on 7 Jan 2019

For a specific proposal, I'd say we should at least set it to /usr/bin:/bin:/usr/sbin:/sbin on macOS.

ulfjack on 7 Jan 2019

I assume the strict PATH env should contain the most common native tools available on that platform. For Windows, I think they should be:

C:\Windows and C:\Windows\System32, for common Windows cmd commands, eg. findstr.exe, whoami.exe, where.exe etc.
C:\Windows\System32\WindowsPowerShell\v1.0, for powershell.exe. @philwo pointed out different powershell versions are all installed in this same path, even if v1.0 is in the path.
C:\tools\msys64\usr\bin and C:\tools\msys64\bin, for bash bin tools. This will change if users specify a different bash executable by --shell_executable or shell toolchain.

But there are two issues to be considered on Windows:

System root could change. Although the most common system root is C:\Windows, users could have different system root. If that's not C:\Windows, the default PATH env is wrong. One possible solution is to replace C:\Windows with %SYSTEMROOT%. We always set %SYSTEMROOT% for every action on Windows and it's a special env var set before launching the process, which won't affect any action cache. See https://github.com/bazelbuild/bazel/blob/bd78bc54c8b66750cf97f704889432a9680bd3a0/src/main/java/com/google/devtools/build/lib/windows/WindowsSubprocessFactory.java#L121
File paths are case insensitive on Windows. Currently, if you change PATH from C:\foo\bar to C:\Foo\bar, the action will be re-executed, I think we should fix this.

meteorcloudy on 7 Jan 2019

We need to achieve three things here:

Have a default that's generally usable for local builds
The default must work with remote execution (e.g., no local usernames, machine-specific directories, etc.); this is to avoid cache thrashing / cross-user cache misses by default
Be easy to override as needed (with the caveat that different settings across users can cause cache misses / thrashing)

I think the images that will be built / are already being built for remote execution will largely put binaries in the locations matching the default PATH.

(I don't think I'm saying anything new, just trying to clarify the situation. Using %SYSTEMROOT% in PATH should be fine; however, we do need to figure out how to handle %SYSTEMROOT% with remote execution to prevent remote cache thrashing.)

ulfjack on 8 Jan 2019

Re @meteorcloudy :
I agree those paths (powershell, system32, msys2 bin and usr/bin) should be on the strict PATH, and nothing more.
Challenge 1 seems to be workaround-able.
I don't see why challenge 2 is a problem in the strict env case (or did you mean this for non-strict env builds?) What am I missing?

Also, it's not clear to me anymore who's waiting on whose input and who are decision makers on this bug. How can we move forward?

I'm happy to volunteer to add an incompatible flag enabling the strict PATH that @meteorcloudy proposed.

laszlocsomor on 8 Jan 2019

@laszlocsomor Oh, you're right, the second point should not be a problem in strict env case.

That would be great if you can make a change to add those path for Windows. @ulfjack Can you take care of this problem on Mac?

meteorcloudy on 8 Jan 2019

@philwo and I talked offline about this issue.

I think it's important to keep in mind that the only reason why we flipped this flag is to improve the usability for remote execution/caching users. At the expense of local execution users who do not have tools installed outside of Bazel's default PATH.

A possible solution that we discussed is to declare PATH as special and to not include it in the remote action definition at all. Instead we argue that the remote execution system should be responsible for providing a default PATH that works for its environment - just as the local system does. The remote system ultimately knows best where all the programs that it wants to make available are installed.

Now, if a user explicitly sets --action_env=PATH=... then we should forward the value to the remote execution system.

I don't really see any downside with this approach. Can anyone talk me out of this?

buchgr on 8 Jan 2019

👍1

@buchgr wouldn't two users with two different local paths (e.g. windows installed in different disks) be unable to use each others remote cache?

I think this topic a bit confusing. My local environment is unlikely to be similar to the remote executors. Especially on windows where I have to make a decision of where to install python/msys. Nothing tells me I have to install those things in default paths in order to be in the happy path. I wouldn't expect that my local path for python would prevent me from using the remote cache that my coworker set.

filipesilva on 8 Jan 2019

@meteorcloudy : I'll wait with implementing this until @buchgr decides whether PATH will be part of the action signature at all. IIUC, if PATH is not part of the action then we can simply disable strict action env with local execution, and my change to the strict PATH would be unnecessary.

laszlocsomor on 9 Jan 2019

@filipesilva The most recent idea would solve that problem, because the PATH (and possibly LD_LIBRARY_PATH and HOME) would not factor into the remote action cache key at all. Instead, the execution strategies would be responsible for setting PATH, LD_LIBRARY_PATH and HOME to good values (similar to what @benjaminp suggested). For example, on a local system, they'd be taken from the client environment (unless overridden by --action_env). On a remote system, the backend knows best what good values for these variables are. For example, when using Docker, it's usual to just specify PATH, HOME etc. inside the Dockerfile, because they have to be tailored to the container. There's no way Bazel could know what the correct variables are for a given container (unless you repeat the definition in some kind of Bazel-specific platform configuration, which would violate DRY and not be great usability).

Thus, two users with different PATHs would be able to get cache hits if everything else is the same for the actions. Whether that's safe or not is not something Bazel can decide, but the user can (by carefully controlling their environment, or by making developer workstations read only from the cache and only let CI machines populate it).

philwo on 9 Jan 2019

which would violate DRY

did you mean ODR?

laszlocsomor on 9 Jan 2019

The danger of removing PATH from the action signature is that it can make the existing incorrectness in local/remote caching (not remote/docker execution) worse. Say I have a gcc in /usr/bin, and one in /usr/local/bin, and Bazel picks up the wrong one because of the PATH. Then I change the path, but Bazel keeps the cache hit from the old one.

The current behavior of including PATH in the cache key is simple, and we understand exactly how it fails. Not including it has consequences that we don't fully understand. My conservative advice is to take more time to think through the consequences and try to come up with a good long-term strategy. My personal preference is for Bazel to be more conservative than necessary, but provide mechanisms to make it more lenient, rather than the other way round.

What scenarios do we want to be cache hits, how do we track inputs, and how do we make sure the right environment variables (incl. PATH) are set? What mechanisms do we want to use to allow users to override Bazel's default behavior? What happens if the same binary is installed in different locations on different machines?

ulfjack on 9 Jan 2019

@philwo I understand that the problem I presented is addressed by @buchgr proposal when --action_env=PATH=... is not set.

But I don't understand how I can avoid setting it in custom local environments.

Perhaps I misunderstood the proposal, but I was under the impression that having any of the needed system tools (cmd, powershell, maybe more?) in a non-default location would force setting --action_env=PATH=....

The path to python is catered for with --python_path, and the path to msys has --shell_executable/BAZEL_SH. So changes to those two defaults don't imply setting action_env.

Perhaps the other tools whose paths are assumed need flags as well to avoid setting action_env.

filipesilva on 9 Jan 2019

👍1

I propose that such binaries like zip, unzip, git which can be used internally in Bazel by a user should have corresponding command line flag where they should come from.

excitoon on 9 Jan 2019

@excitoon : FYI, your proposal points in the direction of https://github.com/bazelbuild/bazel/issues/5265, i.e. expressing the shell (and common tools such as you mentioned) as a toolchain.

laszlocsomor on 9 Jan 2019

👍1

Perhaps I misunderstood the proposal, but I was under the impression that having any of the needed system tools (cmd, powershell, maybe more?) in a non-default location would force setting --action_env=PATH=....

No, the idea is that your local PATH (the one also used by your shell) already points to these tools and then Bazel just uses that PATH. Bazel would no longer use a "default" PATH that only includes some standard directories.

If you have tools that you want Bazel to use that are not in your usual $PATH, you can either modify your $PATH to include them before launching Bazel or use e.g. --action_env=PATH=$PATH:/opt/toolchain/bin.

Does that answer your question? Sorry, I might be misunderstanding something.

philwo on 9 Jan 2019

@philwo yes I think that answers my question. Thank you for explaining it.

filipesilva on 9 Jan 2019

This is confusing. I have an action like

# `example` is a command in a directory of the caller's $PATH variable
ctx.actions.run(
  executable = 'example', 
  use_default_shell_env = True,
  ...
)

that stopped working with bazel 0.21.0 unless --action_env=PATH gets passed to the build command. Is this the intended behaviour? What's the purpose of use_default_shell_env if it doesn't provide the caller's PATH variable straight away? Or was I using the API wrong all along? It worked up until now.

slyrz on 12 Jan 2019

@slyrz I think that's a regression. use_default_shell_env is intended to let you use the user's default environment (and not the built-in, strict environment that Bazel constructs). If that no longer works with strict_action_env, I think that's not intended.

Do you mind filing a bug so that we can fix this? You can basically just copy & paste the comment you wrote here into a new issue.

philwo on 12 Jan 2019

If I remember correctly, you get a completely empty environment without use_default_shell_env, and the strict one with it.

ulfjack on 14 Jan 2019

So, what do we do here - I think the current state is a regression for users and usability and it’s also unintuitive (users will not immediately know why their stuff is failing and will blame Bazel).

I would like to roll this back and let Bazel use the user’s PATH and LD_LIBRARY_PATH by default for local execution (conservatively also including them in the Action cache key, just as before), while we figure out how to still let remote caching work in this case. Remote caching users who depend on the current behavior could temporarily set the flag manually. Remote execution users wouldn’t be affected, because that uses its own PATH anyway.

What do you think?

philwo on 17 Jan 2019

👍1

@philwo : I agree with you.

laszlocsomor on 18 Jan 2019

I discussed this offline with @philwo. We'll rollback the flag flip in Bazel 0.22 and come up with a longer term solution after the 0.22 release.

buchgr on 18 Jan 2019

Just a general comment, from someone who started using bazel a bit over a year ago, I thought part of the point was for bazel actions to be as hermetic as possible and that means either the tool/compiler/interpreter etc used by bazel is either one that is hermetically provided or at the very least directly points to a path of the tool as specified. My understanding is that toolchains can fulfill both use-cases. So why should we ever need PATH apart from having it as a temporary measure? Or maybe for quick prototyping of a rule.

Globegitter on 21 Jan 2019

👍1

@Globegitter Yes, Bazel actions should be as hermetic as possible. Users can check in hermetic toolchains, rules can automatically fetch hermetic toolchains and use them or users can rely on the automatic configuration of toolchains via Starlark repositories that inspect the local environment and then construct a toolchain based on tools from the local system on the fly.

Whether PATH is set or not in actions shouldn't matter for these rules and actions, they are and will stay hermetic (as hermetic as you can get on a local system with the sandboxing we have), as they don't look things up in the PATH anyway.

If you setup your project, tests and rules to be fully hermetic and want to never rely on PATH, you can already today set --action_env=PATH=, which will run actions with an empty PATH.

However, I don't think that's suitable as the default behavior for Bazel. Not even Bazel itself can be built with that setting today and the build will fail almost immediately. genrule and sh_test come to mind as examples of things in a build that like to rely on the PATH to find binaries - which is fine, because hardcoding that e.g. find is always in /usr/bin/find is arguably worse (it could be in /usr/local/bin or in a completely different place when using nixOS). We don't yet have the mechanisms to use bash, zip, unzip, find, ... from a hermetic toolchain, although I heard that people are working on this. One idea would be that then, when we have that feature, sh_* and genrule actions could run with a PATH set to a directory constructed by Bazel that only has tools from the hermetic shell toolchain.

philwo on 21 Jan 2019

We don't yet have the mechanisms to use bash, zip, unzip, find, ... from a hermetic toolchain, although I heard that people are working on this.

I'm afraid they aren't -- https://github.com/bazelbuild/bazel/issues/5265 is unassigned.

laszlocsomor on 21 Jan 2019

Ah! Well, then it's even more important that we leave PATH alone, because users don't even have a better mechanism.

There are a couple of other environment variables that we should also whitelist:

LD_LIBRARY_PATH, because while it's a hack, people with custom toolchains and/or debugging software seem to rely on it and it breaks them if we don't pass it through.
On Windows, PROCESSOR_ARCHITECTURE and friends are important, because otherwise some software cannot determine on which architecture it is running (Python's machine.platform() will just return an empty string when these env vars are not set).

philwo on 21 Jan 2019

This is being further discussed on the mailing list here and here.

edbaunton on 22 Jan 2019

Which team handles this?

dslomov on 15 Feb 2019

@buchgr was interested in flipping this flag, so I'm adding "team-Remote-Exec".

laszlocsomor on 18 Feb 2019

This flag was flipped back in 0.22.0 so I'm dropping the priority.

laszlocsomor on 18 Feb 2019

I am currently working on summarising the discussion we had on the mailing list in a document. Will share once ready and then hopefully we'll be able to resolve this issue in the 0.24 release!

buchgr on 4 Mar 2019

I have drafted up a document with some ideas that's not ready to share yet, because I hadn't have time to polish it. I am currently focusing on delivering https://github.com/bazelbuild/bazel/issues/6862 and once this feature landed, this issue will be my next priority.

buchgr on 11 Mar 2019

👍1

Not sure what the resolution is on this issue, but to weigh in from a Python perspective:

The launcher needs (or rather, will need in a future redesign) to be able to locate the system's python.exe and/or py.exe binaries. This could be done by reading registry keys for the Python installation(s), accessing the PATH at action execution time, or accessing the PATH at workspace evaluation time and storing it for later use (for single-platform builds only). If PATH is wiped out for some actions then that affects what choice we'd want to go with.

brandjon on 21 May 2019

@brandjon

If PATH is wiped out for some actions then that affects what choice we'd want to go with.

I am working on a design atm. The gist is that I propose to add environment variables to toolchains. Will set you as reviewer once published.

buchgr on 22 May 2019

@brandjon to educate myself more, is PATH needed in some action, or is it enough to read PATH from a repository rule doing autodetection of the python toolchain and all python actions then can use this potentially absolute path to python.exe without the need to look into PATH at the execution phase?

hlopko on 22 May 2019

Well the repo rule method saves a little overhead (extra process launch for bash or a C++ wrapper), but it only works for single-platform builds.

Regardless of what stage it's done at, logically the detection needs to be done on the same platform as the target runs on, and the detection needs access to the system PATH. Alternatively, we could require users to provide their own explicit toolchains in more circumstances, instead of relying on the autodetecting one.

See #8414 and bazelbuild/continuous-integration#578 for how autodetection at execution time can break in subtle ways.

brandjon on 22 May 2019

(Ftr anecdata and more C++ context, extra process launch per a C++ compile action on Windows slowed Tensorflow build down by 30%. C++ autodetection involves multiple command invocations and has a potential to involve many more in the future, so performing autodetection as part of every action is a show stopper).

Are we in agreement? I acknowledge how autodetection at execution can break in subtle ways. I'm voting for doing python autodetection only in the repository rule. There I'd generate a python toolchain that points to the python binary using absolute path, completely decoupling itself from the PATH environment variable.

This has a limitation - this autodetection cannot be used without extra step with remote execution. We overcome this by statically preparing remote toolchains in https://github.com/bazelbuild/bazel-toolchains. We are also discussing other ways how to improve this use cases (roughly speaking executing repository rules on the remote worker).

Wdyt?

hlopko on 28 May 2019

Sounds good to me to use a python interpreter determined at workspace evaluation time.

Hypothetically, we could also do autodetection at execution time in addition to workspace evaluation time. The autodetecting toolchain would be given lower priority, so it would only be used when building for Windows but Windows is not the host platform. It also would be unused if an explicit toolchain for the platform is defined. This approach would maximize out-of-the-box ease-of-use with Windows + remote execution, but I'm certain it's overkill for now (and maybe always?).

brandjon on 28 May 2019

Is anyone working to roll this flag forward again?

In node, it's common to run Bazel underneath some other workflow tool, and these can set a different environment every time (yarn sets $PATH with a new temp directory for each run). Users will observe non-incremental builds until they discover the need for this flag.