Bazel: py_binary rules default to importing non-sandboxed code

Created on 11 Jan 2019  路  7Comments  路  Source: bazelbuild/bazel

Description of the problem / feature request:

Python (2.7.15 in this case) populates the first entry of the sys.path by resolving the first argument it is given to its real path (including resolving symlinks) and then using the directory containing that real path. This interferes with Bazel's sandboxing of Python files via symlinks and the runfiles in general, since the actual directory from the original source code will be the first entry in the sys.path.

More info on the method of populating sys.path[0] can be found in these discussions:
https://bugs.python.org/issue6386
https://bugs.python.org/issue17639

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Init a fresh git repo, git apply this patch, and run bazel run py:test

diff --git a/WORKSPACE b/WORKSPACE
new file mode 100644
index 0000000..e69de29
diff --git a/py/BUILD b/py/BUILD
new file mode 100644
index 0000000..ec84ab9
--- /dev/null
+++ b/py/BUILD
@@ -0,0 +1,6 @@
+py_binary(
+        name = 'test',
+        srcs = glob(['**/*.py'], exclude=['foo/baz.py']),
+        imports = [''],
+        main = 'main.py',
+)
diff --git a/py/foo/__init__.py b/py/foo/__init__.py
new file mode 100644
index 0000000..e69de29
diff --git a/py/foo/bar.py b/py/foo/bar.py
new file mode 100644
index 0000000..e69de29
diff --git a/py/foo/baz.py b/py/foo/baz.py
new file mode 100644
index 0000000..e69de29
diff --git a/py/main.py b/py/main.py
new file mode 100644
index 0000000..5c21a2d
--- /dev/null
+++ b/py/main.py
@@ -0,0 +1,10 @@
+import sys
+# sys.path = sys.path[1:]  # Uncommenting this line gives us the expected behavior
+print sys.path
+
+import foo.bar
+print foo.bar.__file__
+
+# This shouldn't work!
+import foo.baz
+print foo.baz.__file__

Note that when sys.path is printed out, the first entry is in the original source code, and not the bazel output directory. Also note that the py/foo/baz.py file isn't included in the target but can be imported. Manually trimming sys.path at the beginning of the main.py file yields the expected behavior.

On solution to this is to not symlink, but rather actually copy, the main file. There are likely other solutions (maybe something involving a .pth file?) but I wasn't able to determine if it was possible to turn off this symlink-following behavior in Python (it seems like not, given the conversations I linked to earlier).

What operating system are you running Bazel on?

macOS 10.14.2

What's the output of bazel info release?

release 0.20.0-homebrew
(though I also repro'd this with non-Brew 0.21.0 and Brew 0.17.2)

Have you found anything relevant by searching the web?

Python issues listed above, regarding sys.path[0]:
https://bugs.python.org/issue6386
https://bugs.python.org/issue17639
Bazel issue regarding symlinking to a runfiles:
https://github.com/bazelbuild/bazel/issues/4022
Commit handling runfiles symlinks on Windows:
https://github.com/bazelbuild/bazel/pull/6036

P3 team-Rules-Python bug

Most helpful comment

Is there any plan to merge proposed fixes for that issue? This is really bad as it completely breaks sandboxing for python binaries and libs.

All 7 comments

Thanks for the summary. Another consequence of this issue is #793, where code that relies on importing unqualified names from the same directory as the script will fail depending on whether the script is generated.

My guess is that the easiest solution is for us to hack around this in the stub file somehow.

It looks like sys.argv[0] initialization happens after site.py processing is done. This means that injecting a custom sitecustomize.py into the runfiles (#7959) could not by itself prevent the source dir of the entry point from being added to sys.path.

Looking at the CPython source, apparently sys.argv[0] is not set if the interpreter is launched with a module name (-m) or command (-c) rather than a script. So if we launched the entry point with -c it might work around this, but at that point we may as well just invoke a wrapper module that completely hacks up sys.path to our liking.

It seems we are running into a similar issue when trying to auto-generate py_binary targets for console_script entry points of pip packages (see https://github.com/apt-itude/rules_pip/pull/13). It seems to be quite common practice to have the file of the binary target to be the same as the package name. E.g. luigi has a bin/luigi.py, or pytest a bin/pytest.py. When trying to execute these targets you get an error such as module luigi not found or similar. See a minimal repro case: https://github.com/Globegitter/py_repro

Right now we have to patch the packages to rename these files, e.g. to bin/luigi-bin.py, or instead of depending directly on the binary in the external repository, we can create a py_binary target in our own workspace basically recreating what the binary would do.

It would be nice not having to do this workaround and just work out of the box with these pip packages.

Is there any plan to merge proposed fixes for that issue? This is really bad as it completely breaks sandboxing for python binaries and libs.

Just wanted to add, for anyone who comes across this like me, that this bug also affects py_test. We have code that we've always compiled into .pars, which works fine, but we're now adding tests and noticed this issue when running bazel test. I was able to work around it for now by explicitly popping the first element of sys.path before importing my modules, like the commits above that referenced this issue, but that's really ugly.

Looks like this was fixed by #9250, which I also confirmed locally on Bazel 1.2.1! Thank you for that @alex-xnor and @brandjon! 馃帀

Oh no, looks like I spoke too soon! I was bamboozled since I came back to this after not having looked at it for a while. The test I was running was importing code from a different module and hence not seeing these issues.

The wrapper script that was updated does indeed prune sys.path[0], however we then do a subprocess.call/os.execv (depending on the OS) which re-invokes Python which in turn re-does the logic to populate the path and adds our directory to sys.path[0], so that does still remain on the path and by-default break us out of the sandbox 馃檨

Was this page helpful?
0 / 5 - 0 ratings