I have a project structure along the lines of this:
packages/
package_a/
BUILD
__init__.py
package_a.py
package_b/
BUILD
__init__.py
package_b.py
protos/
package_a/
BUILD
foobar.proto
service_a/
BUILD
bazqux.proto
services/
service_a/
BUILD
__init__.py
main.py
service_b/
BUILD
__init__.py
main.py
/packages, /protos and /services are all configured as source roots. In each BUILD files residing in protos/* I have either the line protobuf_library(python_source_root='services') or protobuf_library(python_source_root='packages'), which means that the generated code ends up in packages/* or services/* instead.
This is working just fine as long as the running code is in the same source root as the generated protobuf code, but when code in services/ is dependent on protos that has python_source_root set to packages, Python can't find the module unless an actual module from the same source root is also a dependency. I did some digging around, and it seems like the issue is that the source root specified in python_source_root is never explicitly added to Python's syspath, which is why imports fail if no "real" packages from the same source roots are used. So using the same example as earlier I see services and protos, but packages, where the generated code is placed, is missing.
I created a proof-of-concept repository in case my rambling makes little sense. The issue can be seen by running ./pants test services/::.
Interesting, this does make sense! This is the relevant code:
Which maps back to:
Based on this code, we would no matter what be using /protos in the PEX_EXTRA_SYS_PATH (PYTHONPATH) because that is what Get(SourceRoot, SourceRootRequest, SourceRootRequest.for_target(tgt)) evaluates to for the original protobuf_library.
That is, our code is not safe for codegen, which is allowed to place the generated files in any arbitrary location, unlike a normal sources field having to live in the same directory or a subdirectory of the target definition. Similarly tricky, it is legal for codegen to split out its sources into different locations, like generating both dir1/f.ext and dir2/f.ext at the same time.
--
The fix is that rather than using SourceRootRequest.for_target(tgt), which is based on the BUILD file of the original target, we need to use SourceRootRequest.for_file() with codegen:
However, we do still need to only be getting source roots for ResourcesSources and PythonSources, or targets that can generate those. We can't simply iterate over every file in sources because it might include things like files() targets, where it would be a bug to set the PYTHONPATH to include those. So, we need to make that list comprehension in python_sources.py fancier..
Right now, this is a convenience that merges all the sources into a single digest. (Refer to https://www.pantsbuild.org/docs/rules-api-and-target-api#the-sources-field and https://www.pantsbuild.org/docs/rules-api-file-system#core-abstractions-digest-and-snapshot):
We don't want that convenience, we need to associate/zip each target with its specific sources. So, we'd replace that above with all_hydrated_sources = await MultiGet(Get(HydratedSources, HydrateSources(..)) for tgt in request.targets), which will give a sequence of HydratedSoures objects. Use the same kwargs that we're using right now with SourceFilesRequest.
Then, replace the below with more complex logic: zip() the request.targets and the all_hydrated_sources. If tgt.has_field(PythonSources) or tgt.has_field(ResourcesSources), use SourceRootRequest.for_target(tgt) like before. Otherwise, if tgt.get(Sources).can_generate(PythonSources, ..) or can generate ResourcesSources, iterate over all of those targets source files in HydratedSources.files and use SourceRootRequest.for_file(f). This probably wouldn't work in a comprehension anymore, but you can build up a list of Get objects using a for loop and then say await MultiGet(source_root_gets).
Finally, updating the test. Add on to this test with a protobuf_library() that sets python_source_root:
Would you be interested in fixing this @jyggen?
Sure, I can give it a try!
I gave it a shot in this commit, but something isn't quite right because Pants fills the screen with "No source of dependency..." and refuses to start. Any pointers in the right direction would be appreciated!
Huh, I reproduce that extremely confusing rule graph error and I'm not sure why it's happening..sorry for the confusion! Rule graph error messages are a high priority for us to fix.
In the process, though, I realized my suggested approach wasn't fully worked through - I didn't realize that PythonSourceFiles must still have SourceFiles as a field. Instead, I realized a better approach is to keep the original SourceFiles code, but re-hydrate the codegen sources. (The original hydration will have been memoized, so there's little perf impact to this). This diff is working for me:
diff --git a/src/python/pants/backend/python/util_rules/python_sources.py b/src/python/pants/backend/python/util_rules/python_sources.py
index 1d668d7d6..cd891527a 100644
--- a/src/python/pants/backend/python/util_rules/python_sources.py
+++ b/src/python/pants/backend/python/util_rules/python_sources.py
@@ -13,7 +13,7 @@ from pants.core.util_rules.source_files import SourceFiles, SourceFilesRequest
from pants.core.util_rules.stripped_source_files import StrippedSourceFiles
from pants.engine.fs import MergeDigests, Snapshot
from pants.engine.rules import Get, MultiGet, collect_rules, rule
-from pants.engine.target import Sources, Target
+from pants.engine.target import HydratedSources, HydrateSourcesRequest, Sources, Target
from pants.engine.unions import UnionMembership
from pants.source.source_root import SourceRoot, SourceRootRequest
from pants.util.logging import LogLevel
@@ -98,15 +98,39 @@ async def prepare_python_sources(
MergeDigests((sources.snapshot.digest, missing_init_files.snapshot.digest)),
)
- source_root_objs = await MultiGet(
- Get(SourceRoot, SourceRootRequest, SourceRootRequest.for_target(tgt))
- for tgt in request.targets
- if (
- tgt.has_field(PythonSources)
- or tgt.has_field(ResourcesSources)
- or tgt.get(Sources).can_generate(PythonSources, union_membership)
- or tgt.get(Sources).can_generate(ResourcesSources, union_membership)
+ # Codegen is able to generate code in any arbitrary location, unlike sources normally being
+ # rooted under the target definition. To determine source roots for these generated files, we
+ # cannot use the normal `SourceRootRequest.for_target()` and we instead must determine the
+ # source file for every individual generated file. So, we re-resolve the codegen sources here.
+ python_and_resources_targets = []
+ codegen_targets = []
+ for tgt in request.targets:
+ if tgt.has_field(PythonSources) or tgt.has_field(ResourcesSources):
+ python_and_resources_targets.append(tgt)
+ elif tgt.get(Sources).can_generate(PythonSources, union_membership) or tgt.get(
+ Sources
+ ).can_generate(ResourcesSources, union_membership):
+ codegen_targets.append(tgt)
+ codegen_sources = await MultiGet(
+ Get(
+ HydratedSources,
+ HydrateSourcesRequest(
+ tgt.get(Sources), for_sources_types=request.valid_sources_types, enable_codegen=True
+ ),
+ )
+ for tgt in codegen_targets
+ )
+ source_root_requests = [
+ *(SourceRootRequest.for_target(tgt) for tgt in python_and_resources_targets),
+ *(
+ SourceRootRequest.for_file(f)
+ for sources in codegen_sources
+ for f in sources.snapshot.files
)
+ ]
+
+ source_root_objs = await MultiGet(
+ Get(SourceRoot, SourceRootRequest, req) for req in source_root_requests
)
source_root_paths = {source_root_obj.path for source_root_obj in source_root_objs}
return PythonSourceFiles(
The only thing missing now is expanding the tests in python_sources_test to ensure the edge case you identified works. Would you be interested in applying the above diff and updating that test? I'm sorry about the confusing rule graph error and the wrong instructions!