Pants: Are generated subtargets a misfeature?

Created on 22 Jul 2020  路  14Comments  路  Source: pantsbuild/pants

What are generated subtargets?

When possible, Pants will generate multiple smaller targets from one original target. Each generated target will have exactly one sources file.

For example, if a target owns 5 files, Pants would generate 5 new targets. All of the metadata gets copied from the original, except for the address and the sources field.

If a file has multiple owning targets, then we will never attempt to generate a subtarget because it is ambiguous where to copy the metadata from.

When are they used?

  1. File arguments: when safe, we will generate a subtarget for each file specified. Otherwise, we use the original.
  2. --changed-since: this is essentially sugar for file arguments. Same semantics as file arguments.
  3. Explicit file dependencies in BUILD files.

    • We error if there are multiple owners.

  4. Dependency inference: when safe, we generate a subtarget for the specific file that's imported.

Open questions

Dependencies on files in the same owner

Imagine you have the target :utils. strutil.py depends on dirutil.py. In the BUILD file, :utils does not include a dependency on dirutil.py because that is already in the sources field; Dependency inference is also disabled. Then, ./pants dependencies strutil.py will end up not including dirutil.py nor :utils, and we won鈥檛 include dirutil.py anywhere in the closure.

No mechanism to disable generated subtargets

The only way to get the feature to not work is that you have multiple owning targets.

We could possibly add a global flag.

Pros

  1. If you were not already using one target per file, then you will get finer-grained caching without any new BUILD file boilerplate.

    • Finer-grained caching means that we invalidate less frequently, i.e. that we get more cache hits.

    • Finer-grained caching also results in slightly smaller chroots, as we don't copy over as many files.



      • Is this actually noticeable?



  2. File arguments (almost always) operate at the file level.

    • Previously, file arguments were really sugar to operate on targets. We only operated at the file level with lint, fmt, and test due to special logic.

    • Arguably, this makes file arguments more consistent and more useful, as something like ./pants dependencies foo.py will only show the deps of that file (if dep inference is enabled).

  3. Artificial cycles at the target level are less common.

    • This requires that you use dependency inference, however.

    • We agreed that we must tolerate cycles, at least upon request, so this is no longer a major win.

  4. Starts decoupling describing metadata from the dependency graph.

    • There is a tension between wanting to describe metadata coarsely but also wanting fine-grained dependencies. This feature allows you to get both.

Cons

  1. File addresses are deceptive.

    • We output a file name for generated subtargets, but it's not really a file. It's a target that owns exactly one file and was generated from some other target.

    • This nuance risks confusion; it hides key information about how Pants works, i.e. that Pants cares about the metatadata for that target, not only the source file.

  2. We now must teach two concepts, rather than one.

    • If you have any files(), resources(), or python_requirement_library() targets, you will always have conventional targets.

    • Because it's so common to use those target types, we will still need to teach conventional targets.

  3. A user's experience varies substantially depending on if they use file arguments or address arguments, and dep inference or not.

    • This has a risk for confusion why something works one way, but not another.

    • The tool feels less unified and less consistent.

    • See https://github.com/pantsbuild/pants/pull/10420#issue-454768171 for a concrete issue we had in Toolchain because of this. If we use file args, Pants works; otherwise, it errors.

  4. No mechanism to determine the owner of a file.

    • Before, you would use ./pants list foo.py. Now this gives you back the file name foo.py.

    • It's not clear what metadata is being applied for a file, and where to go to edit that metadata.

  5. Substantially increases our code complexity and maintenance burden.

    • This is particularly tricky code to reason about, particularly because there are so many edge cases.

    • Example: this has interfered with adding --query, where it's unclear how we should handle generated subtargets.

    • Example: confusing how to integrate this with dependees https://github.com/pantsbuild/pants/issues/10354

question

All 14 comments

Thanks for opening! I'll take a look tomorrow morning.

cc @cosmicexplorer, I'm curious your thoughts as you've had to interact with this recently with your --query work.

  1. File addresses are deceptive.

    • We output a file name for generated subtargets, but it's not really a file. It's a target that owns exactly one file and was generated from some other target.

    • This nuance risks confusion; it hides key information about how Pants works, i.e. that Pants cares about the metatadata for that target, not only the source file.

There is a path forward for this issue described in the design doc by making the syntax non-ambiguous: ie, applying the @ to select the owner. This works both as a Spec input and as an Address output.

  1. We now must teach two concepts, rather than one.

    • If you have any files(), resources(), or python_requirement_library() targets, you will always have conventional targets.

    • Because it's so common to use those target types, we will still need to teach conventional targets.

This doesn't feel like a second concept... afaict those target types would work the same as any others. Inference is never going to choose to depend on just one file from one of these targets (if it ever figured out how to infer a dependency on those files), but if a human did, it would be perfectly fine to depend on just one file explicitly.

  1. A user's experience varies substantially depending on if they use file arguments or address arguments, and dep inference or not.

    • This has a risk for confusion why something works one way, but not another.

    • The tool feels less unified and less consistent.

    • See https://github.com/pantsbuild/pants/pull/10420#issue-454768171 for a concrete issue we had in Toolchain because of this. If we use file args, Pants works; otherwise, it errors.

As mentioned on that ticket, this general class of issues feels like it could be fixed by adjusting the generation of file targets. The generation of both the base target and the file targets needs to be deterministic and consistent, regardless of which of those a caller is depending on.

With regard to the difference between ./pants test :: and ./pants test '**/*': this was acknowledged during the design session, with consensus that it was reasonable for those to have different effects. AFAICT, it continues to be simplest to treat "address Specs" as expanding to targets, and "file Specs" as expanding to files.

  1. No mechanism to determine the owner of a file.

    • Before, you would use ./pants list foo.py. Now this gives you back the file name foo.py.

    • It's not clear what metadata is being applied for a file, and where to go to edit that metadata.

  2. Substantially increases our code complexity and maintenance burden.

    • This is particularly tricky code to reason about, particularly because there are so many edge cases.

    • Example: this has interfered with adding --query, where it's unclear how we should handle generated subtargets.

    • Example: confusing how to integrate this with dependees https://github.com/pantsbuild/pants/issues/10354

Both of these issues can be resolved by a model change that I have suggested: both files and targets should be nodes in the dependency graph, and the graph introspection goals should gain filtering options (or we should do that via query). The effect of both files and targets being nodes in the graph would be that:

  1. To find targets owning a file, you'd use dependees --direct --targets (ie, with a filter to render only targets).
  2. Likewise, dependees would automatically work for either files or targets (or both, with no filter specified).
  3. filedeps would merge with dependencies via an identical filtering option on output.
  4. query (assuming it lands) would be able to operate on both files and targets.

In my mind, this model change falls into the camp of "Pros"... right now our graph introspection goals have gaps and are bifurcated due to the fact that they each operate on either files or targets, but not both. It would also make file deps inherent: ie, a target always depends on its files, regardless of whether inference is used.


I'll spend some time this morning looking concretely at what API changes would be involved in the model change.

These are my notes based on a convo I just had with Eric.

Stepping back, I'd like to distinguish two concepts, for extra clarity:

A) File-level dependencies.
B) File-level subtargets.

A) is a feature - it is what we want to achieve. B) is "merely" an implementation decision - it was our chosen way to implement A). We chose B) because it would allow downstream rules to work as usual, since they already operate on targets.

However it appears that the implementation is over-complex, and that complexity is leaking out in user-facing ways. "Downstream rules can work as usual" was too naive an assumption.

So, I'd like to propose an alternative implementation of A). One that does require downstream rules to change, but is conceptually simpler, and requires a lot fewer changes elsewhere.

Basically, instead of the basic 'currency' of operation being "list of targets", it would be "list of (source_path, target) pairs".
This would require a bunch of changes in various rules, yes. But it is easy to go from "list of targets" to "list of (source_path, target) pairs" by enumerating the source paths in the target. (*) And it's obviously even easier to go the other way.

So, for example, when a rule needs the transitive dep closure, it requests it for a list of (source_path, target) pairs and gets back a bigger list of (source_path, target) pairs. This is where file-level deps (and dep inference) could kick in.

Basically, if we're saying that "a target is metadata attached to a file" then we probably need to model that fact explicitly in the rules that care, rather than trying to emulate that via these up-front generated subtargets.

I haven't thought through this enough to get a sense of the problems that might ensue, but wanted to see what y'all thought.

(*) Eric mentioned that an intrinsic to evaluate globs without snapshotting the contents would make this more performant.

I haven't thought through this enough to get a sense of the problems that might ensue, but wanted to see what y'all thought.

This is nearly how things work at HEAD: Address now has an optional field that indicates which "base target" it is associated with. So in practice, Address today acts roughly like a union of:

Union[metadata_source, Tuple[File, metadata_source]]

But yes, @benjyw : I think that what you and I are suggesting is fairly similar. The effect of the model change I mentioned would be that Target was roughly:

Target[metadata_source, Optional[File]]

... ie, that every target would own exactly one or zero files, and that when we generated "subtargets", the effect would be that the "base" target owned no files, and instead just depended on its subtargets/file-targets.

The difference between these two models is that I think that you still want to have "named/base target" nodes in the graph (even if they don't "own" files). They exist, people can depend on them (and probably need to in the case of resources), and graph introspection goals should optionally show them.

EDIT: My types were not aligning. The change I'm suggesting has more to do with Targets than addresses, afaict.

Glad we're converging on something here. My main point is that we should never generate a Target. Those should only represent actual BUILD file entities. In my (source file, target) pair scheme, what I really mean is (source file, metadata), where metadata is represented by a target, or - possibly - by some Metadata type derived from a target (even if all that Metadata type is is a copy of the target's fields).

And for example if there is no target at all owning a file that we inferred a dep on, we can just stick in some empty/default metadata. I think.

I think Benjy's proposal is great. It's an extension and improvement on the idea of AddressWithOrigin. There, we are able to use that WithOrigin part to sometimes do things more precisely, like with test and fmt/lint. Here, we can have something like AddressWithSource, which is a pair of (Address, Optional[str]).

If you use an address arg or traditional dependency, then you will depend on every single AddressWithSource that belongs to the target; if you use dependency inference or file arguments, then we can only use the AddressWithSources that are relevant.

Given an AddressWithSource, it's easy to go to Address. This means that some rules that don't care about files can continue to take Addresses as their input, e.g. list.

This allows us to differentiate clearly between what is truly a target vs. what is a file. Currently, ./pants test outputs "Running on target foo.py" for generated subtargets, but "Running on target :foo" if dealing with a normal address. This is a leak to the end user. Instead, we would be able to always choose to output the file name or the address name, like "Running on file foo.py" or "Running on target :foo". Regardless of if the user used file args vs. address args, we have access to both the address and the file. There is no weird mixing of the two ideas.

--

This is very feasible to implement, and we know that because of the implementation of AddressWithOrigin.

Also, I think we could reimplement rules to use this concept with no behavior changes at all relative to the "classic" behavior (pre dep inference). Then introducing file-level dep inference on-top-of/after reworking the rules should be straightforward.

Again, I think...

Also, I think we could reimplement rules to use this concept with no behavior changes at all relative to the "classic" behavior (pre dep inference). Then introducing file-level dep inference on-top-of/after reworking the rules should be straightforward.

Yes, I think so. I think I have a pretty good idea of how to implement this.

Cool... I think that at some level these changes are pretty similar.

This allows us to differentiate clearly between what is truly a target vs. what is a file. Currently, ./pants test outputs "Running on target foo.py" for generated subtargets, but "Running on target :foo" if dealing with a normal address. This is a leak to the end user. Instead, we would be able to always choose to output the file name or the address name, like "Running on file foo.py" or "Running on target :foo". Regardless of if the user used file args vs. address args, we have access to both the address and the file. There is no weird mixing of the two ideas.

AFAICT, it would already be possible to render things this way at HEAD. An Address with a base_target_name is a "file", always, and an address without one is a "target", always. It's possible that AddressWithOrigin is a more typesafe way to represent this difference though, because if you want the owning target, you request an Address, but otherwise you request an AddressWithOrigin?

I'll maintain though that I think changes to Target would further simplify this.

It's possible that AddressWithOrigin is a more typesafe way to represent this difference though, because if you want the owning target, you request an Address, but otherwise you request an AddressWithOrigin?

Yeah, I think so. We could figure out how to get this, but it requires plugin authors being fluent with the property .generated_base_target_name. This is a much more natural way to get the original benefits that prompted generated subtargets.

Also, nit: s/AddressWithOrigin/AddressWithSource. I'm not sure what should happen to WithOrigin. We only use it right now to support precise file arguments, and that is now superseded by WithSource.

Are we happier with this now? Would like to close if we think @stuhood's work has helped allay concerns.

Not all resolved, but we decided to lean more into the experiment to see if that will address the concerns. It's possible that we still end up reverting file dependencies. If we decide that, we can re-open this as relevant.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

stuhood picture stuhood  路  5Comments

jyggen picture jyggen  路  4Comments

cosmicexplorer picture cosmicexplorer  路  8Comments

adabuleanu picture adabuleanu  路  6Comments

pvcnt picture pvcnt  路  3Comments