Bazel: cquery produces non-actionable warning related to toolchains

Created on 22 Aug 2020 · 26Comments · Source: bazelbuild/bazel

Description of the problem / feature request:

there is a target Foo which depends on some toolchain;
--incompatible_override_toolchain_transition is in effect;
3 command is bazel cquery deps(Foo) THENWARNING: Targets were missing from graph` is generated.

The small repro with Python toolchain is published here
We get exactly the same problem with C++ toolchain as well.

What operating system are you running Bazel on?

Windows, Linux, OSX.

What's the output of `bazel info release`?

3.4.1

P2 team-Configurability feature request

Source

konste

Most helpful comment

Okay, I have a clean-looking fix, but it requires a decent bit of cleanup. I'll send out the stages for review over the next week or so (given the break for US holidays this week).

katre on 24 Nov 2020

🚀1 🎉1

All 26 comments

This means that something referred to that target, but it hadn't actually been analyzed as part of the build (in this case, '//:tab_python_runtime_pairis discovered as thetoolchainattribute of//:tab_python_toolchain, but sincetoolchain` is a no-dependency label, there was no call to actually analyze it).

What is your preference? Do not show the message? Make it clearer what is happening?

katre on 31 Aug 2020

Adding @juliexxia as a cquery expert while we determine the right course of action.

katre on 31 Aug 2020

Frankly I still don't quite understand what causes the warning. And the fact that the warning shows up only when --incompatible_override_toolchain_transition is in effect makes it even more confusing.
In this case warning does not look actionable, like - what we supposed to fix to make the warning go away?

konste on 31 Aug 2020

Looks like, without --incompatible_override_toolchain_transition, cquery sees the expected dependency on the toolchain //:tab_python_runtime_pair (in the host configuration), which in turn depends on @python_linux//:tab_python_runtime (also in the host configuration).

With --incompatible_override_toolchain_transition, the toolchain is still //:tab_python_runtime_pair, but in the target configuration, and there is no dependency to @python_linux, because //:tab_python_runtime_pair isn't being fully analyzed.

If I try to build //:main, I get an error:

ERROR: /usr/local/google/home/jcater/repos/cquery_targets_missing_repro/BUILD.bazel:16:10: Couldn't build runfiles for //:main: //:main: missing input file 'external/python_linux/python', owner: '@python_linux//:python'

This may be due to me having to complete the BUILD files for linux (your repro only has Windows paths). Based on the Windows config, I assumed that the correct URL for linux would be https://www.python.org/ftp/python/3.8.5/Python-3.8.5.tgz, is this correct?

katre on 31 Aug 2020

It is not the correct path and I don't see the correct one there. Need 30 min to make Linux compatible repro.

konste on 31 Aug 2020

Thanks!

katre on 31 Aug 2020

re: this specific warning in the query environment.

It's difficult to be totally sure without running the full repro, but the error is coming from here: https://github.com/bazelbuild/bazel/blob/487d0b25213e723339f4f67143f5554e67ffcb3e/src/main/java/com/google/devtools/build/lib/query2/PostAnalysisQueryEnvironment.java#L260 which seems to indicate that there's a mismatch between what targets the query engine expects to be in the graph and what targets are actually there (which we already knew).

We ran into a similar issue a while back with mysteriously not-there garden variety (not-toolchain) nodes in the cquery graph which may be related? https://github.com/bazelbuild/bazel/blob/487d0b25213e723339f4f67143f5554e67ffcb3e/src/main/java/com/google/devtools/build/lib/query2/PostAnalysisQueryEnvironment.java#L424

Basically, I think that while trying to get all dependencies of a particular target in the graph, if any calculated dependency does not fall into any cases in this if/else, this warning will print. Not totally sure without a repro though.

juliexxia on 31 Aug 2020

Repro updated.
Command to run is bazel cquery "deps(main)"
I just added almost empty folder named pseudo_python and pointed new_local_repository to it. We don't need to execute any actual Python to see the repro.

konste on 31 Aug 2020

Thanks!

Okay, I see the same behavior: with the flag, //:tab_python_runtime_pair is in the target config (as expected), and there's no dependency on @python_linux. Without the flag, //:tab_python_runtime_pair is in the host config (also as expected), and there is a dependency on @python_linux. That's unexpected, and I'll debug a bit to try and see why.

One question I had: why are you using the flag --incompatible_override_toolchain_transition? No toolchains have actually migrated yet, and it's only intended for compatibility testing, not for users. It's possible that part of this failure is because the python rules aren't ready for this yet.

katre on 31 Aug 2020

We are not using --incompatible_override_toolchain_transition directly, but we are using --all_incompatible_changes end except for that warning everything else works just fine.

konste on 31 Aug 2020

I am not saying this warning is a blocker by any means, more like an indication of the possible bug which better be discovered early than later.

konste on 31 Aug 2020

It's definitely surprising, I'm going to keep investigating.

katre on 31 Aug 2020

Okay, added some test code to PostAnalysisQueryEnvironment to dig into skyframe and report configured targets with the same label but which aren't the same, and here we are:

WARNING: Targets were missing from graph: [ConfiguredTargetKey{label=//:tab_python_runtime_pair, config=BuildConfigurationValue.Key[285b04bb4645367b5ec24e973f3b8f58c7ba97955812264d6468841139055aed]}]
WARNING: Found similar configured target keys: [ConfiguredTargetKeyWithToolchainContext{label=//:tab_python_runtime_pair, config=BuildConfigurationValue.Key[285b04bb4645367b5ec24e973f3b8f58c7ba97955812264d6468841139055aed], toolchainContextKey=ToolchainContextKey{configurationKey=BuildConfigurationValue.Key[285b04bb4645367b5ec24e973f3b8f58c7ba97955812264d6468841139055aed], requiredToolchainTypeLabels=[@bazel_tools//tools/python:toolchain_type, @bazel_tools//tools/cpp:toolchain_type], execConstraintLabels=[], shouldSanityCheckConfiguration=false}}]

So, the issue is that, when toolchain transitions are in use, toolchains actually have ConfiguredTargetKeyWithToolchainContext (with an added toolchain context key), not a ConfiguredTargetKey (without). Unfortunately, the cquery infrastructure doesn't use the right key types, so it's not found.

Coincidentally, @juliexxia was working on this just yesterday, so it should be fixed soon.

Thanks, @konste, for bringing this to my attention.

katre on 2 Sep 2020

👍1

As far as I understand the fix is released with the version 3.7.0. Is that right?
I ask because we have switched to 3.7.0 and I still see this issue repro, although on a smaller scale than before.
If it is indeed supposed to be fixed in 3.7.0. then I will work on a repro.

konste on 22 Oct 2020

31b0453b84eb2131ff075b9022cb2d6c95e1c71a is in 3.7.0. If you're still seeing strange results, and if it's not a problem to come up with a repro, please do and we'll take a look some more.

katre on 22 Oct 2020

I have tried my original repro here and it still reproduces on both Windows and Linux. Try the command bazel cquery "deps(main)".

konste on 23 Oct 2020

@katre This issue needs to be re-activated. Hopefully, it requires just a little amendment to your existing fix and it would not take much of your time.

konste on 30 Oct 2020

@katre could we please get some traction on this issue so the fix can make it to 4.0? It produces confusing WARNING for our users which we cannot suppress and we get support tickets for that.

konste on 11 Nov 2020

I'm sorry this has been sitting for so long. I have been snowed under by other projects and this has slipped through.

Unfortunately, since the cut for 4.0 is happening shortly and I am busy with BazelCon activities, I don't know if I'll be able to debug this further and identify a fix before the cut.

katre on 12 Nov 2020

@katre Thank you for getting back to me! I realized that meeting 4.0 cut time is tough and retract my ask for it. Still, it would be great to have this issue fixed this year. We try to stay at the edge and upgrade to every monthly Bazel release, often including RCs when they contain the fixes we need, so missing 4.0 cut off is not that important as we hopefully can get the fix only about a month later.

konste on 12 Nov 2020

Okay, I've debugged this further and I can see the problem, but not the solution.

Here's the data flow:

bazel cquery deps(main) corresponds to a call to DepsFunction.eval with a ConfiguredTargetQueryEnvironment to handle the actual processing.
1. The actual data being handled by DepsFunction.eval is a generic type T, which for ConfiguredTargetQueryEnvironment is a ConfiguredTarget object.
2. ConfiguredTarget objects do not know their own ConfiguredTargetKey. This is for both memory reasons (to keep as little data in memory as possible) and correctness reasons (in theory, multiple ConfiguredTargetKey instances could resolve to the same ConfiguredTarget).
DepsFunction.eval calls into PostAnalysisQueryEnvironment.getFwdDeps. This takes a ConfiguredTarget, and finds all of the SkyKey instances that are dependencies of it.
1. For the current issue, this means it takes the ConfiguredTarget for //:main, and finds the dependency keys. This includes many types, but the one of interest is for //:tab_python_runtime_pair, and is actually ConfiguredTargetKeyWithToolchainContext{label=//:tab_python_runtime_pair, config=BuildConfigurationValue.Key[cfc863e38d78b2e0698e051b076d50722ce20dda390189a09d4b6038e48fa3a6], toolchainContextKey=ToolchainContextKey{configurationKey=BuildConfigurationValue.Key[cfc863e38d78b2e0698e051b076d50722ce20dda390189a09d4b6038e48fa3a6], requiredToolchainTypeLabels=[@bazel_tools//tools/python:toolchain_type, @bazel_tools//tools/cpp:toolchain_type], execConstraintLabels=[], shouldSanityCheckConfiguration=false}}.
2. getFwdDeps then processes the keys it discovered, locating the ConfiguredTarget values for each, and storing them as PostAnalysisQueryEnvironment.ClassifiedDependency instances. Importantly, at this point the SkyKey that was used to find the ConfiguredTarget is lost (in PostAnalysisQueryEnvironment.targetifyValues.
3. But even if that was kept in the PostAnalysisQueryEnvironment.ClassifiedDependency object, it would still be lost, because getFwdDeps has to return a Set<ConfiguredTarget> back to DepsFunction. This is because DepsFunction only knows the type the ConfiguredTargetQueryEnvironment is parameterized with, which is ConfiguredTarget.
On the next pass of DepsFunction.eval, the ConfiguredTarget for //:tab_python_runtime_pair is checked, but when ConfiguredTargetQueryEnvironment.getSkyKey is called, it returns a plain ConfiguredTargetKey, not a ConfiguredTargetKeyWithToolchainContext.
This means that the wrong key is used, and no ConfiguredTarget can be found to match, this giving the observed message.

Okay, that's a lot. The root cause here is losing the correct SkyKey instance between rounds of DepsFunction.eval. There are a few possible solutions:

Change the invariant that a ConfiguredTarget does not know its own ConfiguredTargetKey. This has the drawbacks mentioned above.
Update PostAnalysisQueryEnvironment.getFwdDeps to do a second search: if there is no target for a given key, do a looser search for any ConfiguredTarget with the same label and config key, but ignore the ToolchainContext.
1. This could conceivably work, but what if something in the build did depend directly on //:tab_python_runtime_pair? It would then be present twice, with different keys, and the wrong one might be selected.
Change ConfiguredTargetQueryEnvironment to use a different type, which retains the key. Using PostAnalysisQueryEnvironment.ClassifiedDependency might work, if it was extended to also store the key. In this way, we can be sure that successive rounds of DepsFunction.eval use the correct SkyKey for each target.

One important note: this analysis is focused on the proper handling of deps, but this could conceivably happen in any cquery function, so we need to be sure to handle it in a generic fashion for all of them.

@juliexxia, please take a look at my analysis and let me know what you think.

katre on 16 Nov 2020

👍1

Thanks for this thorough investigation @katre.

Change the invariant that a ConfiguredTarget does not know its own ConfiguredTargetKey. This has the drawbacks mentioned above.

I think this would have a lot of side effects outside of this issue (the ones you mentioned) and is probably a non-starter.

Update PostAnalysisQueryEnvironment.getFwdDeps to do a second search: if there is no target for a given key, do a looser search for any ConfiguredTarget with the same label and config key, but ignore the ToolchainContext.
This could conceivably work, but what if something in the build did depend directly on //:tab_python_runtime_pair? It would then be present twice, with different keys, and the wrong one might be selected.

Change ConfiguredTargetQueryEnvironment to use a different type, which retains the key. Using PostAnalysisQueryEnvironment.ClassifiedDependency might work, if it was extended to also store the key. In this way, we can be sure that successive rounds of DepsFunction.eval use the correct SkyKey for each target.

My intuition is that (3) is on the principle the right choice here. Though (2) maybe not actually turn up a huge problem today, it feels more like a bandaid that could lead to more issues down the road - and a tricky bandaid to get right at that. Though ConfiguredTargetKey(WithToolchainContext)s are an implementation detail that the user really doesn't need to know about, with the divergence between ConfiguredTargetKey and ConfiguredTargetKeyWithToolchainContext I think in reality they are the granularity at which cquery should operate.

@katre do you have a sense on how invasive of a change this might be? I'm not positive it has to be too too complicated if we're replacing cquery's ConfiguredTarget parameterization with some other object that is like a ConfiguredTarget+ ConfiguredTargetKey and the existing logic can be mostly kept the same with just one more layer of abstraction?

juliexxia on 18 Nov 2020

https://github.com/bazelbuild/bazel/pull/12508 is the still-in-progress PR. It manages the main issues, but there's something wrong with the alias handling that's causing test failures. Also needs a lot of documentation and polish. Currently updates about 25 files, but that will change as I do some cleanup work and fix some APIs.

Since this seems like the right approach I'll keep pushing but it's not a small or easy change.

katre on 18 Nov 2020

👍2

Okay, I have a clean-looking fix, but it requires a decent bit of cleanup. I'll send out the stages for review over the next week or so (given the break for US holidays this week).

katre on 24 Nov 2020

🚀1 🎉1

Super! Thanks @katre for a kind of unexpected ton of work! But as far as I understand this fix prevents much more serious problems in the future than the cosmetic issue I initially reported.
Which version of Bazel shall we expect to see the fix in?

konste on 25 Nov 2020

Bazel 4.1 would be the release to contain this. I'm not sure it's something we can cherrypick into 4.0, unfortunately.

katre on 25 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings