Bazel 0.21.0 flipped --incompatible_strict_action_env to true via https://github.com/bazelbuild/bazel/issues/6648, which broke CI on macOS and Windows CI. On macOS we can no longer find "md5" (which is in /sbin) and on Windows we can no longer find Python, PowerShell and other tools.
We didn't discover these breakages during testing, because apparently we don't actually test Bazel itself in the "Bazel@HEAD + Downstream Projects".
We'll revert this via a flag in the CI config files, but flag owners should work on fixing our tests (and taking user feedback into account whether this new behavior is actually working well for others) so that we can eventually remove the config entry again.
Relevant logs:
Philipp, thanks for finding, filing, and working around this bug.
We (@philwo , @hlopko , and myself) argued about this flag today and realized that, while sh_* rules should probably not run python
directly (why would a shell script need to do that?), they should be able to run their py_binary
data-dependencies, which rightfully expect to have python
/ python.exe
on the PATH.
It's unclear to me what should Bazel do. It seems that PATH's whole purpose is to contain the locations of these common tools, in which case PATH should be exempt from the "strict" env and be available to the action. On the other hand, PATH is inherently machine-specific and as such it poisons remote caches.
Another important observation @philwo made is that the "strict" action env is just as non-hermetic as the non-strict one; we just replace the clientenv's PATH with a hardcoded PATH string, but whatever is in those directories is out of our control (and not tracked by Bazel). And on one machine they might have python
(and point to python2) but on another they might have no python
at all or have python
pointing to python3.
One option would be to make PATH
work more like TMPDIR
, where it's not part of the action key and injected by the spawn runner at the last moment. That would be a step towards fixing #6392, too.
To prevent this from happening again, I filed https://github.com/bazelbuild/continuous-integration/pull/434
they should be able to run their py_binary data-dependencies, which rightfully expect to have python / python.exe on the PATH. [...] On the other hand, PATH is inherently machine-specific and as such it poisons remote caches.
I can't speak for PATH in general, but from the point of view of the Python interpreter: The long-term solution is to define the interpreter as a toolchain rule, where a particular toolchain target is appropriate for a specific system / path installation. The toolchain target for the current host system could be generated by a repository rule.
Currently py_runtime
is the closest thing we have to a toolchain. The Python stub script only runs the python on PATH if no py_runtime
was specified via --python_top
.
@brandjon : Can you explain how executing a py_binary
-generated target would work if the python interpreter is a toolchain? As I see it, running bazel-bin/foo/py/bin
(which is currently a symlink to the stub python script, whose shebang line is #!/usr/bin/env python
) requires an interpreter on the PATH.
And this is only the Linux/macOS stub script. On Windows we have another stub that is a C++ program (called the "launcher"). @meteorcloudy could correct me on this, but IIRC the local python interpreter's path is embedded in the launcher.
@laszlocsomor The native launcher also only uses python on PATH if no py_runtime
was specified, basically it always uses the same python binary as the python stub template, see https://github.com/bazelbuild/bazel/blob/f451cd89f1b5a3c2d1ef266d5444d0c42f93e322/src/main/java/com/google/devtools/build/lib/bazel/rules/python/BazelPythonSemantics.java#L200
That said, the python binary (including the native launcher for python) won't be a problem if python path is explicitly specified even with --incompatible_strict_action_env
I think we need to change the default PATH on macos and Windows. It might be worthwhile to roll back the flag flip, except it may be too late now?
@meteorcloudy : Thanks, that's comforting to know.
@ulfjack : I agree that it's probably too late to roll back the flag flip commit now. What do you think the default PATH should contain on macOS and Windows?
Regarding Windows:
Windows Server 2008 R2
, and that version includes powershell 2.0, it's reasonable to add it on the path.I don't have a mac available right now, but I take it /sbin is on the PATH by default, so we should add it to the default path in Bazel as well.
It's
/Users/[user]/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:
according to https://www.reddit.com/r/MacOS/comments/5zabo3/question_what_is_the_default_path_for_macos/, although we should leave out the users home bin directory for hermeticity.
For a specific proposal, I'd say we should at least set it to /usr/bin:/bin:/usr/sbin:/sbin
on macOS.
I assume the strict PATH env should contain the most common native tools available on that platform. For Windows, I think they should be:
C:\Windows
and C:\Windows\System32
, for common Windows cmd commands, eg. findstr.exe
, whoami.exe
, where.exe
etc.C:\Windows\System32\WindowsPowerShell\v1.0
, for powershell.exe
. @philwo pointed out different powershell versions are all installed in this same path, even if v1.0
is in the path.C:\tools\msys64\usr\bin
and C:\tools\msys64\bin
, for bash bin tools. This will change if users specify a different bash executable by --shell_executable
or shell toolchain.But there are two issues to be considered on Windows:
C:\Windows
, users could have different system root. If that's not C:\Windows
, the default PATH env is wrong. One possible solution is to replace C:\Windows
with %SYSTEMROOT%
. We always set %SYSTEMROOT%
for every action on Windows and it's a special env var set before launching the process, which won't affect any action cache. See https://github.com/bazelbuild/bazel/blob/bd78bc54c8b66750cf97f704889432a9680bd3a0/src/main/java/com/google/devtools/build/lib/windows/WindowsSubprocessFactory.java#L121PATH
from C:\foo\bar
to C:\Foo\bar
, the action will be re-executed, I think we should fix this.We need to achieve three things here:
I think the images that will be built / are already being built for remote execution will largely put binaries in the locations matching the default PATH.
(I don't think I'm saying anything new, just trying to clarify the situation. Using %SYSTEMROOT% in PATH should be fine; however, we do need to figure out how to handle %SYSTEMROOT% with remote execution to prevent remote cache thrashing.)
Re @meteorcloudy :
I agree those paths (powershell, system32, msys2 bin and usr/bin) should be on the strict PATH, and nothing more.
Challenge 1 seems to be workaround-able.
I don't see why challenge 2 is a problem in the strict env case (or did you mean this for non-strict env builds?) What am I missing?
Also, it's not clear to me anymore who's waiting on whose input and who are decision makers on this bug. How can we move forward?
I'm happy to volunteer to add an incompatible flag enabling the strict PATH that @meteorcloudy proposed.
@laszlocsomor Oh, you're right, the second point should not be a problem in strict env case.
That would be great if you can make a change to add those path for Windows. @ulfjack Can you take care of this problem on Mac?
@philwo and I talked offline about this issue.
I think it's important to keep in mind that the only reason why we flipped this flag is to improve the usability for remote execution/caching users. At the expense of local execution users who do not have tools installed outside of Bazel's default PATH.
A possible solution that we discussed is to declare PATH as special and to not include it in the remote action definition at all. Instead we argue that the remote execution system should be responsible for providing a default PATH that works for its environment - just as the local system does. The remote system ultimately knows best where all the programs that it wants to make available are installed.
Now, if a user explicitly sets --action_env=PATH=...
then we should forward the value to the remote execution system.
I don't really see any downside with this approach. Can anyone talk me out of this?
@buchgr wouldn't two users with two different local paths (e.g. windows installed in different disks) be unable to use each others remote cache?
I think this topic a bit confusing. My local environment is unlikely to be similar to the remote executors. Especially on windows where I have to make a decision of where to install python/msys. Nothing tells me I have to install those things in default paths in order to be in the happy path. I wouldn't expect that my local path for python would prevent me from using the remote cache that my coworker set.
@meteorcloudy : I'll wait with implementing this until @buchgr decides whether PATH will be part of the action signature at all. IIUC, if PATH is not part of the action then we can simply disable strict action env with local execution, and my change to the strict PATH would be unnecessary.
@filipesilva The most recent idea would solve that problem, because the PATH (and possibly LD_LIBRARY_PATH and HOME) would not factor into the remote action cache key at all. Instead, the execution strategies would be responsible for setting PATH, LD_LIBRARY_PATH and HOME to good values (similar to what @benjaminp suggested). For example, on a local system, they'd be taken from the client environment (unless overridden by --action_env). On a remote system, the backend knows best what good values for these variables are. For example, when using Docker, it's usual to just specify PATH, HOME etc. inside the Dockerfile, because they have to be tailored to the container. There's no way Bazel could know what the correct variables are for a given container (unless you repeat the definition in some kind of Bazel-specific platform configuration, which would violate DRY and not be great usability).
Thus, two users with different PATHs would be able to get cache hits if everything else is the same for the actions. Whether that's safe or not is not something Bazel can decide, but the user can (by carefully controlling their environment, or by making developer workstations read only from the cache and only let CI machines populate it).
which would violate DRY
did you mean ODR?
The danger of removing PATH from the action signature is that it can make the existing incorrectness in local/remote caching (not remote/docker execution) worse. Say I have a gcc in /usr/bin, and one in /usr/local/bin, and Bazel picks up the wrong one because of the PATH. Then I change the path, but Bazel keeps the cache hit from the old one.
The current behavior of including PATH in the cache key is simple, and we understand exactly how it fails. Not including it has consequences that we don't fully understand. My conservative advice is to take more time to think through the consequences and try to come up with a good long-term strategy. My personal preference is for Bazel to be more conservative than necessary, but provide mechanisms to make it more lenient, rather than the other way round.
What scenarios do we want to be cache hits, how do we track inputs, and how do we make sure the right environment variables (incl. PATH) are set? What mechanisms do we want to use to allow users to override Bazel's default behavior? What happens if the same binary is installed in different locations on different machines?
@philwo I understand that the problem I presented is addressed by @buchgr proposal when --action_env=PATH=...
is not set.
But I don't understand how I can avoid setting it in custom local environments.
Perhaps I misunderstood the proposal, but I was under the impression that having any of the needed system tools (cmd, powershell, maybe more?) in a non-default location would force setting --action_env=PATH=...
.
The path to python is catered for with --python_path
, and the path to msys
has --shell_executable
/BAZEL_SH
. So changes to those two defaults don't imply setting action_env
.
Perhaps the other tools whose paths are assumed need flags as well to avoid setting action_env
.
I propose that such binaries like zip
, unzip
, git
which can be used internally in Bazel by a user should have corresponding command line flag where they should come from.
@excitoon : FYI, your proposal points in the direction of https://github.com/bazelbuild/bazel/issues/5265, i.e. expressing the shell (and common tools such as you mentioned) as a toolchain.
Perhaps I misunderstood the proposal, but I was under the impression that having any of the needed system tools (cmd, powershell, maybe more?) in a non-default location would force setting --action_env=PATH=....
No, the idea is that your local PATH (the one also used by your shell) already points to these tools and then Bazel just uses that PATH. Bazel would no longer use a "default" PATH that only includes some standard directories.
If you have tools that you want Bazel to use that are not in your usual $PATH, you can either modify your $PATH to include them before launching Bazel or use e.g. --action_env=PATH=$PATH:/opt/toolchain/bin
.
Does that answer your question? Sorry, I might be misunderstanding something.
@philwo yes I think that answers my question. Thank you for explaining it.
This is confusing. I have an action like
# `example` is a command in a directory of the caller's $PATH variable
ctx.actions.run(
executable = 'example',
use_default_shell_env = True,
...
)
that stopped working with bazel 0.21.0
unless --action_env=PATH
gets passed to the build command. Is this the intended behaviour? What's the purpose of use_default_shell_env
if it doesn't provide the caller's PATH variable straight away? Or was I using the API wrong all along? It worked up until now.
@slyrz I think that's a regression. use_default_shell_env
is intended to let you use the user's default environment (and not the built-in, strict environment that Bazel constructs). If that no longer works with strict_action_env, I think that's not intended.
Do you mind filing a bug so that we can fix this? You can basically just copy & paste the comment you wrote here into a new issue.
If I remember correctly, you get a completely empty environment without use_default_shell_env
, and the strict one with it.
So, what do we do here - I think the current state is a regression for users and usability and it鈥檚 also unintuitive (users will not immediately know why their stuff is failing and will blame Bazel).
I would like to roll this back and let Bazel use the user鈥檚 PATH and LD_LIBRARY_PATH by default for local execution (conservatively also including them in the Action cache key, just as before), while we figure out how to still let remote caching work in this case. Remote caching users who depend on the current behavior could temporarily set the flag manually. Remote execution users wouldn鈥檛 be affected, because that uses its own PATH anyway.
What do you think?
@philwo : I agree with you.
I discussed this offline with @philwo. We'll rollback the flag flip in Bazel 0.22 and come up with a longer term solution after the 0.22 release.
Just a general comment, from someone who started using bazel a bit over a year ago, I thought part of the point was for bazel actions to be as hermetic as possible and that means either the tool/compiler/interpreter etc used by bazel is either one that is hermetically provided or at the very least directly points to a path of the tool as specified. My understanding is that toolchains can fulfill both use-cases. So why should we ever need PATH
apart from having it as a temporary measure? Or maybe for quick prototyping of a rule.
@Globegitter Yes, Bazel actions should be as hermetic as possible. Users can check in hermetic toolchains, rules can automatically fetch hermetic toolchains and use them or users can rely on the automatic configuration of toolchains via Starlark repositories that inspect the local environment and then construct a toolchain based on tools from the local system on the fly.
Whether PATH is set or not in actions shouldn't matter for these rules and actions, they are and will stay hermetic (as hermetic as you can get on a local system with the sandboxing we have), as they don't look things up in the PATH anyway.
If you setup your project, tests and rules to be fully hermetic and want to never rely on PATH, you can already today set --action_env=PATH=
, which will run actions with an empty PATH.
However, I don't think that's suitable as the default behavior for Bazel. Not even Bazel itself can be built with that setting today and the build will fail almost immediately. genrule
and sh_test
come to mind as examples of things in a build that like to rely on the PATH to find binaries - which is fine, because hardcoding that e.g. find
is always in /usr/bin/find
is arguably worse (it could be in /usr/local/bin
or in a completely different place when using nixOS). We don't yet have the mechanisms to use bash, zip, unzip, find, ... from a hermetic toolchain, although I heard that people are working on this. One idea would be that then, when we have that feature, sh_* and genrule actions could run with a PATH set to a directory constructed by Bazel that only has tools from the hermetic shell toolchain.
We don't yet have the mechanisms to use bash, zip, unzip, find, ... from a hermetic toolchain, although I heard that people are working on this.
I'm afraid they aren't -- https://github.com/bazelbuild/bazel/issues/5265 is unassigned.
Ah! Well, then it's even more important that we leave PATH alone, because users don't even have a better mechanism.
There are a couple of other environment variables that we should also whitelist:
machine.platform()
will just return an empty string when these env vars are not set).Which team handles this?
@buchgr was interested in flipping this flag, so I'm adding "team-Remote-Exec".
This flag was flipped back in 0.22.0 so I'm dropping the priority.
I am currently working on summarising the discussion we had on the mailing list in a document. Will share once ready and then hopefully we'll be able to resolve this issue in the 0.24 release!
I have drafted up a document with some ideas that's not ready to share yet, because I hadn't have time to polish it. I am currently focusing on delivering https://github.com/bazelbuild/bazel/issues/6862 and once this feature landed, this issue will be my next priority.
Not sure what the resolution is on this issue, but to weigh in from a Python perspective:
The launcher needs (or rather, will need in a future redesign) to be able to locate the system's python.exe
and/or py.exe
binaries. This could be done by reading registry keys for the Python installation(s), accessing the PATH
at action execution time, or accessing the PATH
at workspace evaluation time and storing it for later use (for single-platform builds only). If PATH
is wiped out for some actions then that affects what choice we'd want to go with.
@brandjon
If PATH is wiped out for some actions then that affects what choice we'd want to go with.
I am working on a design atm. The gist is that I propose to add environment variables to toolchains. Will set you as reviewer once published.
@brandjon to educate myself more, is PATH needed in some action, or is it enough to read PATH from a repository rule doing autodetection of the python toolchain and all python actions then can use this potentially absolute path to python.exe without the need to look into PATH at the execution phase?
Well the repo rule method saves a little overhead (extra process launch for bash or a C++ wrapper), but it only works for single-platform builds.
Regardless of what stage it's done at, logically the detection needs to be done on the same platform as the target runs on, and the detection needs access to the system PATH. Alternatively, we could require users to provide their own explicit toolchains in more circumstances, instead of relying on the autodetecting one.
See #8414 and bazelbuild/continuous-integration#578 for how autodetection at execution time can break in subtle ways.
(Ftr anecdata and more C++ context, extra process launch per a C++ compile action on Windows slowed Tensorflow build down by 30%. C++ autodetection involves multiple command invocations and has a potential to involve many more in the future, so performing autodetection as part of every action is a show stopper).
Are we in agreement? I acknowledge how autodetection at execution can break in subtle ways. I'm voting for doing python autodetection only in the repository rule. There I'd generate a python toolchain that points to the python
binary using absolute path, completely decoupling itself from the PATH
environment variable.
This has a limitation - this autodetection cannot be used without extra step with remote execution. We overcome this by statically preparing remote toolchains in https://github.com/bazelbuild/bazel-toolchains. We are also discussing other ways how to improve this use cases (roughly speaking executing repository rules on the remote worker).
Wdyt?
Sounds good to me to use a python interpreter determined at workspace evaluation time.
Hypothetically, we could also do autodetection at execution time in addition to workspace evaluation time. The autodetecting toolchain would be given lower priority, so it would only be used when building for Windows but Windows is not the host platform. It also would be unused if an explicit toolchain for the platform is defined. This approach would maximize out-of-the-box ease-of-use with Windows + remote execution, but I'm certain it's overkill for now (and maybe always?).
Is anyone working to roll this flag forward again?
In node, it's common to run Bazel underneath some other workflow tool, and these can set a different environment every time (yarn sets $PATH
with a new temp directory for each run). Users will observe non-incremental builds until they discover the need for this flag.
If anyone is, it's @buchgr or @ishikhman .
(@buchgr is out for this week and the next though.)
there were some discussions around this. @buchgr will be able to give a better update when he is back in a week.
Any updates here? Can this be flipped again?
Unfortunately I鈥檓 not aware of any further work that has happened in this area. I think the situation is still unchanged from a year ago.
is there an RFC for how this might be addressed?