Pants: Switch pants distribution model from "python embeds rust" to "rust embeds python"

Created on 12 Mar 2019  路  25Comments  路  Source: pantsbuild/pants

tl;dr: See the title.

What this would mean would be: rather than shipping wheels (or even pexes, as in #4896), we would ship static rust executables the same way that we currently ship pexes for releases. To do that, we would move away from:

Locate and use a python interpreter on the user's system to run python code that loads a rust engine

and toward

A self contained rust binary with an embedded python interpreter loads some python code "from within the binary"


Implementation wise, this might look like Mercurial's approach (although there is probably more-modern prior-art): https://www.mercurial-scm.org/wiki/OxidationPlan

An advantage to this would be removing our dependence on a python interpreter on the user's machine, and improving our path forward for low-latency startup.

Q42020-idea releases

Most helpful comment

The latest code in the main branch of PyOxidizer contains a ton of changes to how resource packaging works. And I'm actively making PyOxidizer far less opinionated/catered to the single executable/binary use case. As part of this transition, I've also found and hopefully fixed a lot of paper cuts around how extension modules are handled.

Unless you need a single file executable, I'd encourage you to play around with the more traditional execution modes where you have dynamic libraries for extension modules and on-disk resources. This is how I plan on initially shipping Mercurial's Python 3 installers. Mercurial's (in-progress) pyoxidizer.bzl is the following:

ROOT = CWD + "/../.."

IS_WINDOWS = "windows" in BUILD_TARGET_TRIPLE

# Code to run in Python interpreter.
RUN_CODE = "import hgdemandimport; hgdemandimport.enable(); from mercurial import dispatch; dispatch.run()"

set_build_path(ROOT + "/build/pyoxidizer")

def make_distribution():
    return default_python_distribution()

def make_distribution_windows():
    return default_python_distribution(flavor = "standalone_dynamic")

def resource_callback(policy, resource):
    # We use a custom resource routing policy to influence where things are loaded
    # from.
    #
    # For Python modules and resources, we load from memory if they are in
    # the standard library and from the filesystem if not. This is because
    # parts of Mercurial and some 3rd party packages aren't yet compatible
    # with memory loading.
    #
    # For Python extension modules, we load from the filesystem because
    # this yields greatest compatibility.
    if type(resource) in ("PythonModuleSource", "PythonPackageResource", "PythonPackageDistributionResource"):
        if resource.is_stdlib:
            resource.add_location = "in-memory"
        else:
            resource.add_location = "filesystem-relative:lib"

    elif type(resource) == "PythonExtensionModule":
        resource.add_location = "filesystem-relative:lib"

def make_exe(dist):
    """Builds a Rust-wrapped Mercurial binary."""
    packaging_policy = dist.make_python_packaging_policy()
    # Extension may depend on any Python functionality. Include all
    # extensions.
    packaging_policy.extension_module_filter = "all"
    packaging_policy.resources_policy = "prefer-in-memory-fallback-filesystem-relative:lib"
    packaging_policy.register_resource_callback(resource_callback)

    config = PythonInterpreterConfig(
        raw_allocator = "system",
        run_eval = RUN_CODE,
        # We want to let the user load extensions from the file system
        filesystem_importer = True,
        # We need this to make resourceutil happy, since it looks for sys.frozen.
        sys_frozen = True,
        legacy_windows_stdio = True,
    )

    exe = dist.to_python_executable(
        name = "hg",
        packaging_policy = packaging_policy,
        config = config,
    )

    # Add Mercurial to resources.
    exe.add_python_resources(exe.pip_install(["--verbose", ROOT]))

    # On Windows, we install extra packages for convenience.
    if IS_WINDOWS:
        exe.add_python_resources(
            exe.pip_install(["-r", ROOT + "/contrib/packaging/requirements_win32.txt"]),
        )

    return exe

def make_manifest(dist, exe):
    m = FileManifest()
    m.add_python_resource(".", exe)

    return m

def make_embedded_resources(exe):
    return exe.to_embedded_resources()

register_target("distribution_posix", make_distribution)
register_target("distribution_windows", make_distribution_windows)

register_target("exe_posix", make_exe, depends = ["distribution_posix"])
register_target("exe_windows", make_exe, depends = ["distribution_windows"])

register_target(
    "app_posix",
    make_manifest,
    depends = ["distribution_posix", "exe_posix"],
    default = "windows" not in BUILD_TARGET_TRIPLE,
)
register_target(
    "app_windows",
    make_manifest,
    depends = ["distribution_windows", "exe_windows"],
    default = "windows" in BUILD_TARGET_TRIPLE,
)

resolve_targets()

# END OF COMMON USER-ADJUSTED SETTINGS.
#
# Everything below this is typically managed by PyOxidizer and doesn't need
# to be updated by people.

PYOXIDIZER_VERSION = "0.8.0-pre"
PYOXIDIZER_COMMIT = "2ea18328969e00b64cf91a575c680435e8deb022"

The important lines there are default_python_distribution(flavor = "standalone_dynamic") to use a more traditional Windows distribution with shared libraries and the resource_callback function defining a custom routing policy for resources. Standard library to memory, extensions and non-stdlib resources to the filesystem. This mostly _just works_.

At this point, I highly recommend running PyOxidizer from latest commit on the main branch in source control. I try to keep main high quality and I'm blocked from releasing an official 0.8 release due to reliance on a few Rust crates with bug fixes not yet releases on crates.io.

All 25 comments

I think this is a really good idea, and follows somewhat the model where we provide a compiler and linker toolchain that works on any supported platform, instead of trying to write code which can conform to every possible user's compiler and linker toolchain. This also means we can get extremely funky with the python interpreter we ship (pypy!!!!) as long as we don't break existing pants plugins.

More relevant background https://code.fb.com/data-infrastructure/xars-a-more-efficient-open-source-system-for-self-contained-executables/ (at https://github.com/facebookincubator/xar/ on github). The squashfs part seems like it might not be relevant, but it is one model for interfacing with heterogenous resources across multiple source languages.

https://github.com/indygreg/PyOxidizer is also a good bit of prior art that I'd forgotten to mention before.

This might be a fun hackweek project!

I can't tell whether it is impressive or terrifying that PyOxidizer seems to have completely wrapped away the build process here: https://pyoxidizer.readthedocs.io/en/latest/getting_started.html

Depending how well it works, it has the potential to drop a lot of boilerplate... but it might also really obfuscate things.

Less frameworky alternative to pyoxidize appears to be the cpython crate: https://www.reddit.com/r/rust/comments/fxe99l/cffi_vs_cpython_vs_pyo3_what_should_i_use/

Maintainer of PyOxidizer here. I stumbled across this issue and thought I'd leave a comment.

I just wanted to drop links to the relatively new PyOxidizer documentation about using PyOxidizer as a mechanism to maintaining Rust projects which embed Python. https://pyoxidizer.readthedocs.io/en/latest/rust.html, https://pyoxidizer.readthedocs.io/en/latest/rust_rust_code.html, and https://pyoxidizer.readthedocs.io/en/latest/rust_porting.html are the most relevant links.

While it is certainly possible to use the cpython or pyo3 crates for managing an embedded Python interpreter yourself, it can get pretty tedious pretty quickly. If full blown PyOxidizer is too heavyweight for you, you can use just the run-time crates that are part of the PyOxidizer project -pyembed and oxidized_importer - for embedded Python interpreter control and custom Python importing, respectively. This is arguably the better approach for Rust developers (as opposed to Python application maintainers, who are PyOxidizer's primary audience).

Please make noise on PyOxidizer's GitHub project if you want more features from the lower-level Rust crates. Finally, there are many improvements to these crates in the yet-to-be-released 0.8 release. So be sure to evaluate the main branch instead of the 0.7 release!

@indygreg : Thanks a lot for the links and offer! That's very helpful. Since creating this, we have in fact ported to cpython (in #9593), and so we are one step closer to taking a look at PyOxidizer.

The documentation is also really thorough: thank you.

@indygreg : Due to at least half serendipitous need (and some other fraction due to your helpful comment the other day!) I'm taking a look at this today. Seem to be making progress so far: thanks again!

@indygreg : I ran out of steam for the day, but I'm currently figuring out how to actually include loose sources in the PythonExecutable. The top commit here attempts to add sources: https://github.com/pantsbuild/pants/commits/3b132b42ac8a349bec9eeb86a237432ad3e2733e ... the commit below it sets run_module='pants.bin.pants_loader:main', which builds fine and (as expected without any sources included) fails to load.

But adding sources results in:

error[PYOXIDIZER_BUILD]: in-memory-only resources policy active but in-memory extension module importing not supported by this configuration
   --> /Users/stuhood/src/pants/src/rust/engine/pyoxidizer.bzl:113:5
    |
113 | /     exe.add_in_memory_python_resources(dist.read_package_root(
114 | |         path=CWD + "/../../../src/python",
115 | |         packages=["pants", "pants.bin", "pants.bin.pants_loader"],
116 | |     ))
    | |______^ add_extension_module


error: in-memory-only resources policy active but in-memory extension module importing not supported by this configuration

EDIT: Backtracking a bit, this looks related to the PythonDistribution returned by default_python_distribution(). On this OSX, standalone_static and standalone work (but encounter the above error about the resources policy), and standalone_dynamic fails saying:

error: could not find default Python distribution for x86_64-apple-darwin

EDIT2: Ok, trying with a much simpler module with no dependencies seems to successfully build and run! Huzzah. So it seems like read_package_root is potentially returning non-PythonSourceModule items when used on larger portions of the repository.

EDIT3: The above "in-memory extension module importing not supported by this configuration" error appears to have been caused by a .so file left behind by our previous build mechanism, which loaded the rust code as a cdylib. Removing that file allows more progress to be made!

Try switching the add_in_memory_python_resources() to just add_python_resources(): that will tag the resource for in-memory or filesystem loading depending on the capabilities of the Python interpreter.

Also, I'm not at all surprised that read_package_root() is picking up extra files. I found some issues in the resource scanning code a few weeks ago. Things are much better on the main branch.

I also recommend using the pip_install() method if possible: it will install everything to a temporary directory then scan for resources there. This should eliminate the orphaned .so problem you ran into.

Also, I just overhauled the documentation on available methods for collecting Python resources. The new docs live at https://pyoxidizer.readthedocs.io/en/latest/packaging_python_files.html. The docs now include some warnings regarding caveats which you ran into!

re:

Try switching the add_in_memory_python_resources() to just add_python_resources(): that will tag the resource for in-memory or filesystem loading depending on the capabilities of the Python interpreter.
I also recommend using the pip_install() method if possible: it will install everything to a temporary directory then scan for resources there. This should eliminate the orphaned .so problem you ran into.

Ok, thanks. After removing the stray .so file, loading more complicated source-only modules seems to work via read_package_root + add_in_memory_python_resources.

But the next issue is that I now need to actually import some 3rdparty/binary dependencies, and so am trying out the pip_install method. It also seems to be encountering some PythonExtensionModule instances and getting the "in-memory-only resources policy active but in-memory extension module importing not supported by this configuration" error... but neither add_in_memory_python_resources nor add_python_resources work in this case.

The requirements file is: https://github.com/pantsbuild/pants/blob/master/3rdparty/python/requirements.txt ... in this case, is the idea that some of these 3rdparty distributions are _only_ available as binary? It would be handy if this error included which PythonExtensionModule was incompatible: https://github.com/indygreg/PyOxidizer/blob/e2d5b152013bd5a925380b9d4a87d0b153025706/pyoxidizer/src/py_packaging/standalone_distribution.rs#L1758 (... I'm building up a list that I'll follow up to try and fix, or at the very least report!)

Also, I just overhauled the documentation on available methods for collecting Python resources. The new docs live at https://pyoxidizer.readthedocs.io/en/latest/packaging_python_files.html. The docs now include some warnings regarding caveats which you ran into!

Great, thanks: will read those now!

The requirements file is: https://github.com/pantsbuild/pants/blob/master/3rdparty/python/requirements.txt ... in this case, is the idea that some of these 3rdparty distributions are only available as binary? It would be handy if this error included which PythonExtensionModule was incompatible: https://github.com/indygreg/PyOxidizer/blob/e2d5b152013bd5a925380b9d4a87d0b153025706/pyoxidizer/src/py_packaging/standalone_distribution.rs#L1758 (... I'm building up a list that I'll follow up to try and fix, or at the very least report!)

https://github.com/indygreg/PyOxidizer/pull/272 exposes that (at least?) mypy and python-Levenshtein were problematic here.

I'll try a non-in-memory resources policy.

@indygreg : Made some more progress today: thanks again for your involvement!

The current status in https://github.com/pantsbuild/pants/commits/0e742635489cb846c6edee22a174a7fdba09eeb5 is that all of our source code appears to be successfully included! I can open a repl and import various simple modules. Some more complicated modules fail to load: see below.

The next two issues I'm looking at are:

  1. Resources that we try to load using a combination of __name__ and pkgutil.get_data are not yet available. In particular, this line returns a None resource that fails to decode: https://github.com/pantsbuild/pants/blob/2f266c34e32a6d9a29d40ac92798b7e04cb8dd64/src/python/pants/version.py#L16-L17.
  2. To handle the native extensions that we use (mypy, python-Levenshtein, setproctitle, etc), I needed to use prefer-in-memory-fallback-filesystem-relative:$ARG, but I don't understand exactly what to include as the argument there yet, and so they fail to load at runtime.

It seems like both of the above are likely addressed in https://pyoxidizer.readthedocs.io/en/latest/packaging_pitfalls.html , so I'll be studying that next.

The comments reflecting your mental stream as you debug this are very helpful! I'm already realizing there are some gaps in PyOxidizer's documentation! Please keep it up!

I needed to use prefer-in-memory-fallback-filesystem-relative:$ARG, but I don't understand exactly what to include as the argument there yet, and so they fail to load at runtime.

Aha... https://pyoxidizer.readthedocs.io/en/latest/packaging_resources.html#python-resource-locations appears to have been the missing link here (literally? could maybe be linked-from/embedded-in https://pyoxidizer.readthedocs.io/en/latest/config_api.html#config-python-resources-policy ?)

Reading through this, it occurs to me that a better factoring of https://github.com/indygreg/PyOxidizer/pull/272 might be to have all of the "bulk adding of resources" APIs identify all incompatible arguments and summarize them in an error message.

Currently looking into how to align the prefer-in-memory-fallback-filesystem-relative argument with the on disk contents via https://pyoxidizer.readthedocs.io/en/latest/packaging_resources.html#routing-python-resources-to-locations

@indygreg : Took a bit of a detour this morning, but made a bit of progress.

The detour was due to attempting to re-include this line from the template, thinking that that was a necessary step to end up with a "complete artifact", with a binary/exe next to a libs directory.

# Add the generated executable to our install layout in the root directory.
files.add_python_resource(".", exe)

But that attempted to subprocess to build a crate, and failed. I think that I've realized that that is not actually necessary in our environment: we're building from a build.rs script, which is part of our ongoing crate build: instead, what the build.rs script needs to consume is the output of the (default) resources target (in src/rust/engine/pyembed-gen/x86_64-apple-darwin/debug/resources).

But, I _do_ still need a FileManifest in order to expose the PythonExtensionModules that were captured via pip install.

I've got to admit that I'm struggling a bit with the target concept: they seem to run and produce sideeffects under BUILD_PATH/x86_64-apple-darwin/debug/$targets, and pyembed is able to successfully consume those sideeffects. But I'm confused about the ordering semantics and what causes output to actually be produced. I can't seem to get output for both the FileManifest containing the extension modules _and_ the PythonEmbeddedResources object at the same time.

(I'll admit that it's a bit tempting to try to directly use the pyoxidizer rust API from a build.rs script, but that's not a suggested option on https://pyoxidizer.readthedocs.io/en/latest/rust_crate_configuration.html# ...)

The conclusion of my previous comment was to use FileManifest.install inside the target that was generating the PythonEmbeddedResources... not ideal, but workable?

The result looks like this: https://github.com/pantsbuild/pants/commits/3d01878e47cbf0fe629dcc2fc8d57903bc160486 , which is able to find the following native modules, which it attempts to load: https://gist.github.com/stuhood/a1759cd4b33588ff01c479be3b4c100c

Unfortunately, a few (but not all?) of the native modules fail to load: if I load a repl and poke at these, I see:

  • levenshtein fails with a missing symbol

    • could be ported to rust

  • pywatchman.bser loads fine!

    • has been removed in master

  • mypy fails with a missing symbol

    • doesn't actually need to be linked into our binary: is only used in pexes that we fork

  • setproctitle fails with a missing symbol

    • Challenging to remove... there isn't any portable rust code to do this currently.

  • psutil loads fine!
  • coverage fails due to missing __file__ (ie https://github.com/indygreg/PyOxidizer/issues/69 : I think that we can drop this dep, but I wonder how many other cases of this we'll encounter)

    • doesn't actually need to be linked into our binary: is only used in pexes that we fork

  • thriftpy2 fails to load because the _scproxy backend is missing (filtered out because we use extension_module_filter='no-gpl',, I think)

    • Is used by pyzipkin, and could be ported to rust or removed

  • cryptography fails to load because _cffi_backend has a missing symbol

    • Is used by requests: pretty challenging to remove, and something that is likely for plugins to want to consume.

      ```

import cryptography.hazmat.bindings._padding
Traceback (most recent call last):
File "", line 1, in
ImportError: dlopen(/Users/stuhood/src/pants/src/rust/engine/target/debug/thirdparty_requirements/_cffi_backend.cpython-37m-darwin.so, 2): Symbol not found: _PyThreadState_Delete
Referenced from: /Users/stuhood/src/pants/src/rust/engine/target/debug/thirdparty_requirements/_cffi_backend.cpython-37m-darwin.so
Expected in: flat namespace
in /Users/stuhood/src/pants/src/rust/engine/target/debug/thirdparty_requirements/_cffi_backend.cpython-37m-darwin.so
```

An example missing symbol error for levenshtein:

dlopen(/Users/stuhood/src/pants/src/rust/engine/target/debug/thirdparty_requirements/Levenshtein/_levenshtein.cpython-37m-darwin.so, 2): Symbol not found: _PyUnicode_FromUnicode
  Referenced from: /Users/stuhood/src/pants/src/rust/engine/target/debug/thirdparty_requirements/Levenshtein/_levenshtein.cpython-37m-darwin.so
  Expected in: flat namespace
 in /Users/stuhood/src/pants/src/rust/engine/target/debug/thirdparty_requirements/Levenshtein/_levenshtein.cpython-37m-darwin.so

@indygreg: Are the missing symbols expected, or is this a function of how I'm trying to load these?

I've completed the list of problematic extension modules above. Most of them we could probably drop our dependencies on with a bit of work.

But requests ( requests[security] uses cryptography, which uses cffi) is probably the lone scary one. Not being able to use code that uses cffi feels like a blocker, both because requests (which is widely used) uses it, but also because cffi itself is probably widely used.

@indygreg : Are there workarounds for cffi in particular? EDIT: Dang, sorry: should have searched first: https://github.com/indygreg/PyOxidizer/issues/26 looks relevant, but https://github.com/indygreg/PyOxidizer/issues/170 still looks like a problem.

Any progress on this?

The latest code in the main branch of PyOxidizer contains a ton of changes to how resource packaging works. And I'm actively making PyOxidizer far less opinionated/catered to the single executable/binary use case. As part of this transition, I've also found and hopefully fixed a lot of paper cuts around how extension modules are handled.

Unless you need a single file executable, I'd encourage you to play around with the more traditional execution modes where you have dynamic libraries for extension modules and on-disk resources. This is how I plan on initially shipping Mercurial's Python 3 installers. Mercurial's (in-progress) pyoxidizer.bzl is the following:

ROOT = CWD + "/../.."

IS_WINDOWS = "windows" in BUILD_TARGET_TRIPLE

# Code to run in Python interpreter.
RUN_CODE = "import hgdemandimport; hgdemandimport.enable(); from mercurial import dispatch; dispatch.run()"

set_build_path(ROOT + "/build/pyoxidizer")

def make_distribution():
    return default_python_distribution()

def make_distribution_windows():
    return default_python_distribution(flavor = "standalone_dynamic")

def resource_callback(policy, resource):
    # We use a custom resource routing policy to influence where things are loaded
    # from.
    #
    # For Python modules and resources, we load from memory if they are in
    # the standard library and from the filesystem if not. This is because
    # parts of Mercurial and some 3rd party packages aren't yet compatible
    # with memory loading.
    #
    # For Python extension modules, we load from the filesystem because
    # this yields greatest compatibility.
    if type(resource) in ("PythonModuleSource", "PythonPackageResource", "PythonPackageDistributionResource"):
        if resource.is_stdlib:
            resource.add_location = "in-memory"
        else:
            resource.add_location = "filesystem-relative:lib"

    elif type(resource) == "PythonExtensionModule":
        resource.add_location = "filesystem-relative:lib"

def make_exe(dist):
    """Builds a Rust-wrapped Mercurial binary."""
    packaging_policy = dist.make_python_packaging_policy()
    # Extension may depend on any Python functionality. Include all
    # extensions.
    packaging_policy.extension_module_filter = "all"
    packaging_policy.resources_policy = "prefer-in-memory-fallback-filesystem-relative:lib"
    packaging_policy.register_resource_callback(resource_callback)

    config = PythonInterpreterConfig(
        raw_allocator = "system",
        run_eval = RUN_CODE,
        # We want to let the user load extensions from the file system
        filesystem_importer = True,
        # We need this to make resourceutil happy, since it looks for sys.frozen.
        sys_frozen = True,
        legacy_windows_stdio = True,
    )

    exe = dist.to_python_executable(
        name = "hg",
        packaging_policy = packaging_policy,
        config = config,
    )

    # Add Mercurial to resources.
    exe.add_python_resources(exe.pip_install(["--verbose", ROOT]))

    # On Windows, we install extra packages for convenience.
    if IS_WINDOWS:
        exe.add_python_resources(
            exe.pip_install(["-r", ROOT + "/contrib/packaging/requirements_win32.txt"]),
        )

    return exe

def make_manifest(dist, exe):
    m = FileManifest()
    m.add_python_resource(".", exe)

    return m

def make_embedded_resources(exe):
    return exe.to_embedded_resources()

register_target("distribution_posix", make_distribution)
register_target("distribution_windows", make_distribution_windows)

register_target("exe_posix", make_exe, depends = ["distribution_posix"])
register_target("exe_windows", make_exe, depends = ["distribution_windows"])

register_target(
    "app_posix",
    make_manifest,
    depends = ["distribution_posix", "exe_posix"],
    default = "windows" not in BUILD_TARGET_TRIPLE,
)
register_target(
    "app_windows",
    make_manifest,
    depends = ["distribution_windows", "exe_windows"],
    default = "windows" in BUILD_TARGET_TRIPLE,
)

resolve_targets()

# END OF COMMON USER-ADJUSTED SETTINGS.
#
# Everything below this is typically managed by PyOxidizer and doesn't need
# to be updated by people.

PYOXIDIZER_VERSION = "0.8.0-pre"
PYOXIDIZER_COMMIT = "2ea18328969e00b64cf91a575c680435e8deb022"

The important lines there are default_python_distribution(flavor = "standalone_dynamic") to use a more traditional Windows distribution with shared libraries and the resource_callback function defining a custom routing policy for resources. Standard library to memory, extensions and non-stdlib resources to the filesystem. This mostly _just works_.

At this point, I highly recommend running PyOxidizer from latest commit on the main branch in source control. I try to keep main high quality and I'm blocked from releasing an official 0.8 release due to reliance on a few Rust crates with bug fixes not yet releases on crates.io.

Thanks a lot for the update! This remains a very exciting possibility, and I'll look into prioritizing it again in the next few weeks.

@ofek : To help with prioritization: would changing the distribution model of Pants be helpful for you in some way?

I'm considering using pants and came across this thread so I thought I'd bump it.

This issue, as described, represents a large change. Presumably most of the latency benefits can be gained without trying to switch the distribution model at all, but with just switching the pantsd client to rust. I've broken out this slice in #11831.

Was this page helpful?
0 / 5 - 0 ratings