Pants: Refactor .with_binaries() into a Target field

Created on 29 Mar 2021  路  13Comments  路  Source: pantsbuild/pants

As we no longer have to maintain backwards compatibility with 1.x, we can leave the .with_binaries() behind.

Agreed on syntax, that's an uncleaned-up holdover from providing 1.x 2.x compatibility we no longer need to maintain.

https://pantsbuild.slack.com/archives/CASMF8SJ1/p1617036654055400?thread_ts=1617029155.053700&cid=CASMF8SJ1

publishing python tech-debt user reported

Most helpful comment

Really, provides=setup_py() no longer makes any sense at all. Previously, you could add this to a python_library. In that case, it made sense, you were singling out a python_library for publication and needed to add publication data. Now that there is python_distribution - which already says "publish me" right on the tin, provides=setup_py(...) should probably go and all data should migrate to the target's direct fields.

All 13 comments

Great idea. Thoughts on what this target field should look like? Maybe this:

python_distribution(
   name="dist",
   dependencies=[":lib"],
   console_script_binaries={"my_command": "path/to:pex_binary"},
   provides=setup_py(...)
)

Some notes with this modeling:

  1. Right now, you can either use a dictionary or a kwarg. That's confusing to have two ways of doing things. This field should instead be a dict[str, str] always, even if you only have one entry. No shorthand.
  2. The name console_script_binaries is meant to be explicit what this does, and to leave the door open for integrating with other types of binaries like shiv_binary.

Thoughts? Definitely open to recommendations, including better field names.

Directions to implement

1. Add the field

Follow the guide at https://www.pantsbuild.org/v2.4/docs/target-api-new-fields to add a field to python/target_types.py. Check that it's wired correctly via ./pants help python_distribution. You will use DictStringToStringField. This field is not required.

2. Wire it up

We use .with_binaries in two places. First, a dep inference rule:

https://github.com/pantsbuild/pants/blob/0a006783f229a2e4e9ed4dfb201db478ca2904ba/src/python/pants/backend/python/target_types_rules.py#L141-L159

Second, setup.py:

https://github.com/pantsbuild/pants/blob/0a006783f229a2e4e9ed4dfb201db478ca2904ba/src/python/pants/backend/python/goals/setup_py.py#L516-L554

Both of these call-sites will have very few changes involved - all that needs to change is to not only look at PythonArtifact.binaries, but to also look at your new field. Note that we must still consider the PythonArtifact.binaries (.with_binaries()) until the deprecation is complete. We'll want to merge the two dictionaries into a single dict[str, str]. Perhaps add a method to your new Field that will take a PythonArtifact as an arg and merge the two? Then the call-sites use something like target[PythonDistributionConsoleScriptBinaries].merged_value(target[PythonProvidesField].value) method, instead of target[PythonProvidesField].value.binaries?

Update the tests for python/target_types_test.py and python/goals/setup_py_test.py to use your new field. Imo, because we're deprecating .with_binaries(), it's fine to simply replace that part of the tests with the new syntax - altho it would be even more robust to expand the tests so that both approaches are used and we confirm we merge.

3. Deprecate the old .with_binaries()

Use @deprecated from pants.base.deprecated here:

https://github.com/pantsbuild/pants/blob/0a006783f229a2e4e9ed4dfb201db478ca2904ba/src/python/pants/backend/python/macros/python_artifact.py#L43

Confirm it works by testing with src/python/pants:pants-packaged, which will actually need to be updated while you're at it.

I started thinking about the alternative new syntax, but didn't have the bandwidth for it yesterday.
The one you propose is straight forward enough, but I got an idea on this mornings walk I'd like to run by you, see what you make of it.

python_distribution(
   name="dist",
   dependencies=[":lib"],
   provides=setup_py(
      entry_points=[
         entry_point_py(type="console_script", name="my_command", entry_point="path/to:pex_binary"),
         entry_point_py(type="custom_name", name="foo", entry_point="lib.module:foo"),
      ],
      ...
   ),
)

I'm thinking, that to keep the console script binaries syntax as close as possible to any other entry point.. also, this would make it easy to support inferring dependencies the same, regardless of what kind of entry point we're dealing with.
By having the declaration in a type with its own parameters also leaves room for new features down the road, without breaking old syntax. The entry_point_py may treat the "console_script" type specifically, and any other types as just generic dummies, passing the information along to the generated setup.py (while still inferring dependencies, though).

Also, I wonder why you placed the console_script_binaries directly on python_distribution rather than on the setup_py?
The reason for it that I come up with would be if there may be other distribution formats other than setup_py, but that also could use console_scripts.. but then again, there are other fields too, like version etc that fits the same description, so I still don't find what the argument could be for having it outside the setup_py parameter.

Interesting idea to combine console scripts and other types of entry points!

--

I don't think we want to introduce entry_point_py, which would mean introducing another magical symbol. Generally, setup_py is using legacy mechanisms and doesn't integrate nicely with things like ./pants help. That does not mean the overall idea is bad, only using a new special symbol.

wonder why you placed the console_script_binaries directly on python_distribution rather than on the setup_py?

Indeed. My thinking is that setup_py should be seen as roughly a 1-1 mapping between what goes in the BUILD file and what gets passed to the setup() function in setup.py. Pants should not be doing many magical things from the setup_py() contents, and magic should be done via normal fields that can have a help message.

--

We could possibly have an entry_points field on the PythonDistribution? The syntax could mirror pex_binary approaches 1 and 2: https://www.pantsbuild.org/docs/python-package-goal#the-entry_point-field

I think I like that modeling for the sake of https://github.com/pantsbuild/pants/issues/11807, that it makes it more explicit we'll attempt dep inference on the entry_points, rather than us extracting that out of the setup_py. If we do that, we could deprecate setting entry_points directly in the setup_py and eventually ban it, perhaps.

But the tricky part imo is how to model that in a field, as it's really a dict[str, dict[str, str]]. We can support that with the Target API, only I'm wondering if it's too complex?

python_distribution(
   name="dist",
   entry_points={"console_scripts": {"my_cmd": "path/to:tgt"}},
   ...
)

Maybe it's okay? I don't hate it, and I think I like it more than putting it on setup_py(). I also think that I like it de-emphasizing console_scripts as being something special.

Yeah, OK, now I see a valid point for having the entry point stuff outside provides, as we pull dependency information from it.

I like your example with entry_points on the python distribution. The plus side is that the syntax is the same as the one one is already familiar with from the one we already use in setup.py files (with the addition to support pants address to pex binary targets)

Really, provides=setup_py() no longer makes any sense at all. Previously, you could add this to a python_library. In that case, it made sense, you were singling out a python_library for publication and needed to add publication data. Now that there is python_distribution - which already says "publish me" right on the tin, provides=setup_py(...) should probably go and all data should migrate to the target's direct fields.

I've been working on updated instructions based on the entry_points proposal, but one thing is tripping me up: how to disambiguate between addresses vs. modules.

I think this is the objective now:

python_distribution(
   name="dist",
   dependencies=[":lib"],
   entry_points={"console_scripts": {"my_command": "path/to:pex_binary"}, "gui_scripts": {"gui": "path.to.module:func"}},
   provides=setup_py(...)
)

Maybe I'm overthinking. I think we can use the heuristic that if / is in the str, treat it as an address. Target addresses are always a path; when they're at the top-level, they should be prefixed by //. We do need to afford the syntax :sibling_tgt, but we can special case that. Otherwise, assume it's a module.

We will continue to enforce that the address is a pex_binary target with an entry_point with the :func specified. Modules must end in :my_func.

From an end-user perspective, would you find it confusing that we allow for either addresses to pex_binary targets _or_ Python modules?

--

In fact, my gut says that we should stop having the pex_binary affordance. All we ever do is extract the entry_point field from it:

https://github.com/pantsbuild/pants/blob/6db9355ebc77853f02bc6aded82151375d113f9c/src/python/pants/backend/python/goals/setup_py.py#L516-L554

That is a whole lot of complexity鈥攊n terms of code and documentation鈥攐nly to save the user from typing my_module:func both in the pex_binary and python_distribution. I think that this connection with pex_binary is probably somewhat legacy to the day when pex_binary could have a sources field.

I suspect that we should have entry_points solely use the path.to.module:func format. We can easily do dep inference on that.

What do you think @kaos?

Yeah, well, it is not such a big win to have it in, so if it adds a lot of complexity, I think it could go.

However, regarding heuristics, I like a pragmatic approach of just trying to resolve it as a target address first, always. If it hits, use it, else treat it as a python module reference. How likely is it that you will have conflicts here?

But again, maybe not that useful to warrant the potential confusion. Although, come to think of the implications when targeting a pex_binary (or other, in the future) is the added value we can extract from that. Now, we only use the entry_point, not such a big win in and by itself, but, what if we also add the dependencies of that pex binary, which could be a lot more than just the dependencies we get from the entry point itself. And the entry point is not that much duplication to get rid of, but the list of dependencies are, I'd say.

One comment on the confluence of "magical symbol" and difficulty parsing strings - these go hand in hand.

Stepping slightly back from the case in hand where a pex_binary address serves as an alias for an entry point specification and nothing more to the more general case where an address might point to much more hand-maintained metadata: We have a persistent problem addressed in an ad-hoc manner currently parsing string fields.

One influence is a desire to limit BUILD syntax to target types (constructors only) and python primitive types. This influence forbids the general technique of introducing new types for disambiguation like we do all the time in rule code for example. So we're not allowing address("module:func") where module is a top-level directory to disambiguate from "module:func"; instead we're forced to do some extra logic to see if module is a top-level directory, and if so either warn to use "//..." to disambiguate and fail or else adopt a heuristic of address wins in ambiguous situations. Clearly we _could_ make extra constructor-only types beyond target types well supported and exactly as magical as target type constructors themselves with work to integrate these into help and docs; so "magical" is not a helpful term afaict except to point out a current happenstance limitation that's just a question of effort.

To be clear, I'm not suggesting address(...) or entry_point_py(...) are good ideas, I just mean to separate out fundamental problems from happenstance ones. I think running into strings parsing disambiguation problems is fundamental. Solving this general problem in a non ad-hoc way with non-target struct-like types is a possibility that is probably only unattractive as a happenstance.

I like a pragmatic approach of just trying to resolve it as a target address first, always. If it hits, use it, else treat it as a python module reference. How likely is it that you will have conflicts here?

At the moment, our code fails when you try to resolve an Address that does not exist. This is in large part due to the introduction of _file targets_ in 2.0, vs. the more traditional _BUILD targets_. We have to determine which type the user is using:

https://github.com/pantsbuild/pants/blob/0df82888546ec8a86ca977bed4897c1b017be9ff/src/python/pants/engine/internals/build_files.py#L48-L69

So, we can't default to treating things as Address. I don't think this is something we can really refactor our way out of: there is a fundamental ambiguity between file targets and BUILD targets that we must consult the file system to disambiguate.

--

but, what if we also add the dependencies of that pex binary, which could be a lot more than just the dependencies we get from the entry point itself. And the entry point is not that much duplication to get rid of, but the list of dependencies are, I'd say.

Ah, that's interesting because that is the case now that we infer a dependency on the pex_binary for .with_binaries(), which means that we get all the transitive deps too. We would be losing that. However, I wonder how common it is to put your dependencies on the pex_binary rather than, say, a python_library (legit question, not rhetorical). Now that pex_binary does not have a sources field, my thinking is that most people would put dependencies on the python_library because it describes what the actual code needs:

For example, for the sake of argument, main.py depends on NumPy but it's not inferrable. I would expect people to put that dep on the python_library, which would be necessary so that things like Pylint and Pytest include the dep. It's the source file that has the dep.

python_library(name="lib", dependencies=["//:numpy"])
pex_binary(name="pex", entry_point="main.py")

If people are putting their deps on the underlying python_library rather than pex_binary, then there's no issue.

If it's still the case that there's lots of duplication, people can factor it out with a generic target() target type: https://www.pantsbuild.org/docs/reference-target.

--

I think one of the reasons I'm leaning more and more against special-casing pex_binary addresses is that I fear it suggests we're doing more than we actually are: inferring a dep + extracting the entry_point field. We have two other fields where depending on a pex_binary actually _creates_ the Pex binary: python_tests.runtime_package_dependencies and archive.packages - we do not create the PEX here, and imo that's inconsistent.

--

@kaos Thoughts? I truly really appreciate your perspective here. @jsirois does a great job reminding us maintainers that targets are generally a fairly foreign concept and it's easy to forget that the more time you spend with Pants.

Sounds reasonable. And as you say, it makes sense to have the dependencies deeper, and not directly at the pex binary. With that, I see little reason to keep the support for referencing pex binary targets from console entry points.

I have to admit, I didn't fully grasp all of that what John wrote.

Great! Thanks for proposing that design and talking it through! Updated instructions:

Our goal is something like this:

python_distribution(
   name="dist",
   dependencies=[":lib"],
   entry_points={"console_scripts": {"my_command": "path.to.module:func"}, "gui_scripts": {"gui": "path.to.module:func"}},
   provides=setup_py(...)
)

The entry points need to take the form of path.to.module:func.

1. Add the field

Follow the guide at https://www.pantsbuild.org/v2.4/docs/target-api-new-fields to add a field to python/target_types.py. You will use Field and implement compute_value() to have the type hint Optional[Dict[str, Dict[str, str]]] for raw_value. It'll look similar to this:
https://github.com/pantsbuild/pants/blob/2b412dbfe1b803f8502cda29b6dce46690137592/src/python/pants/engine/target.py#L1157-L1175

I think we will want to use the EntryPoint class already defined for the sake of pex_binary, which offers a really convenient representation:

https://github.com/pantsbuild/pants/blob/bb3fa05d4b32512eb8674d6dbfacb5f07e1f8596/src/python/pants/backend/python/target_types.py#L123-L126

You can then hook it up to the Field like this:

https://github.com/pantsbuild/pants/blob/bb3fa05d4b32512eb8674d6dbfacb5f07e1f8596/src/python/pants/backend/python/target_types.py#L188-L209

Note how the raw_value kwarg shows what the user puts into the BUILD file, but the return type for compute_value() and the value field both show the normalized data type. I think you'll want those to both be FrozenDict[str, FrozenDict[str, EntryPoint]] (FrozenDict so that the Field is hashable).

Further, I think you'll want to validate that every EntryPoint.function is defined, which is required by setuptools iiuc? https://setuptools.readthedocs.io/en/latest/userguide/entry_point.html#console-scripts
Raise InvalidFieldException if not the case.

Check that it's wired correctly via ./pants help python_distribution.

2. Wire up to setup_py.py

The main place to update is here:

https://github.com/pantsbuild/pants/blob/bb3fa05d4b32512eb8674d6dbfacb5f07e1f8596/src/python/pants/backend/python/goals/setup_py.py#L490-L550

Note that we still need to handle:

  • .with_binaries()
  • The user defining entry_points already. We need to avoid overwriting that and instead update it.

    • I think the current code already handles this the right way.

    • We should always allow users to still define via setup_py() imo. One reason is so that the setup_kwargs plugin hook we have can still dynamically add entry points. The only downside to doing it this way is that dep inference won't work.

From my quick glance at

https://github.com/pantsbuild/pants/blob/bb3fa05d4b32512eb8674d6dbfacb5f07e1f8596/src/python/pants/backend/python/goals/setup_py.py#L546-L550

Move those first two lines out of the for loop. Use things like .extend() to update things. Be careful dealing with a dict[str, dict[str, str]] - it's easy to accidentally replace the inner dictionary instead of appending to it.

It'd be great to update the test to ensure we're doing this the right way - define an entry point both via the new Field and via setup_py(), and check that we merge correctly.

https://github.com/pantsbuild/pants/blob/bb3fa05d4b32512eb8674d6dbfacb5f07e1f8596/src/python/pants/backend/python/goals/setup_py_test.py#L134

3. Add dependency inference

Refer to https://github.com/pantsbuild/pants/issues/11807#issuecomment-809697859, but you can ignore everything about await Get(SetupKwargs, ExportedTarget). You are simply going to look at the target.get(PythonDistributionsEntryPoints).value.

4. Deprecate .with_binaries()

Same as https://github.com/pantsbuild/pants/issues/11808#issuecomment-809661455. In the deprecation message, mention the differences, like setting {"console_scripts": ...} now; that you have to explicitly provide the module name rather than it using the pex_binary; and that Pants will infer a dependency on the owner of the module (usually a python_library or python_requirement_library) rather than the pex_binary, and to run ./pants dependencies --transitive path/to:python_distribution before and after.

Further, I think you'll want to validate that every EntryPoint.function is defined, which is required by setuptools iiuc?

No, only <module> is required.

The syntax for entry points is specified as follows:
<name> = [<package>.[<subpackage>.]]<module>[:<object>.<object>]

4. Deprecate .with_binaries()

Is a deprecation notice such as this OK?

  <string>:4: DeprecationWarning: DEPRECATED: pants.backend.python.macros.python_artifact.with_binaries will be removed in version 2.6.0dev0.
    Use `python_distribution(entry_points={...})` instead of
  `python_distribution(provides=setup_py().with_binaries(...))`.

  The syntax for entry points must follow that of setuptools, and is specified as
  follows:

      <name> = [<package>.[<subpackage>.]]<module>[:<object>.<object>]

  Example migration, before:

      pex_binary(name="my_library_bin", entry_point="my.library.bin:main")

      python_distribution(
          provides=setup_py(...).with_binaries({'my_command': ':my_library_bin'})
      )

  after:

      python_distribution(
          provides=setup_py(...),
          entry_points={
              'console_scripts': {
                  'my_command': 'my.library.bin:main'
              }
          }
      )

  The entry point must now be provided explicitly and are not derived from a
  `pex_binary` target.

  Pants will infer a dependency on the owner of the entry point module (usually a
  `python_library` or `python_requirement_library`).

  Please run the following command before and after migrating from
  `.with_binaries()` to verify that the correct dependencies are inferred.

      ./pants dependencies --transitive path/to:python_distribution

-- Docs: https://docs.pytest.org/en/stable/warnings.html
Was this page helpful?
0 / 5 - 0 ratings